RE: [PHP-I18N] proposal: unification of the grapheme_extract functions

Texin, Tex Sat, 10 May 2008 14:31:10 -0700

Thanks Ed. I remember the discussion now.
Personally I don't think it makes sense.
It is an option that should be offered, because it is good for performance, but 
it is more tedious programming and harder to migrate programs to use this 
functionality.


The tradeoff is like this:

Let's say a program is on the third character in a string. Today the program 
knows it is at an index of 3.
In a multibyte world if the start value is the character count the first thing 
it does is scan the string to find the byte offset where the 3rd character 
begins.
However, it is likely that this same byte position is known from immediately 
prior work on the string. So passing byte length around saves frequent 
rescanning of the string.
An important caveat is that if the string is modified the byte counts have to 
be thrown away, or at least those after the string is modified.

On the other hand, most existing code is doing character count arithmetic and 
changing it means replacing simple indexing with functions to get byte offsets.
It is harder to convert the code. It is of course possible to make the intl 
extension much smarter and remember index to byte mappings, but we didn't have 
time in the initial version.

$start = 3;
//does stuff at 3 and then wants to do stuff 4 characters after this position.
$extractbegin = $start + 4;
$ext = $substr( $mystr, $extractbegin, $len);

Becomes code that has to:
Call a function to find the byte offset of character 3 in the string (by 
scanning).
Needs 2 variables to remember both current character count and byte count

Needs to call a function to find the byte offset of character 7 by either 
scanning from the beginning of the string or starting from the known offset of 
character 3.
$ext=graphemeextract....

My preference is for start to optionally be grapheme or character count and let 
the migration be quick and then add optimizations into the extension to 
recognize strings that are ascii, cache recently used offsets, etc.
But that's just me...

For most programs the performance enhancement of using byte offsets is 
countered by the extra function calls etc. Especially for the typically short 
strings.
(Scanning large buffers repeatedly for offsets into the last few characters can 
hurt, but can usually be worked around thru other optimizations.)

And making the migration difficult will reduce the number of programs that 
actually support languages that need graphemes...

This wasn't your decision so no reflection on you of course. Next version 
should add in support for start values to be grapheme counts....
tex


> -----Original Message-----
> From: Ed Batutis [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, May 08, 2008 12:42 PM
> To: Texin, Tex; php-i18n@lists.php.net
> Subject: RE: [PHP-I18N] proposal: unification of the 
> grapheme_extract functions
> 
> 
> >  If I use GRAPHEME_EXTR_MAXBYTES, does it return ...
> 
> > I assume it is the max # of whole graphemes that do not 
> exceed the max 
> > bytes.
> 
> Yes. It works just like the old grapheme_extractb.
> 
> > Also, the $start value is that in byte, character or grapheme units 
> > for each of the types?
> 
> The start value is always bytes. I was unsure if this made 
> sense, really, but it is consistent (and easy to implement).
> 
> =Ed
> 
> 
> 

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP-I18N] proposal: unification of the grapheme_extract functions

Reply via email to