Thanks Ed. I remember the discussion now. Personally I don't think it makes sense. It is an option that should be offered, because it is good for performance, but it is more tedious programming and harder to migrate programs to use this functionality.
The tradeoff is like this: Let's say a program is on the third character in a string. Today the program knows it is at an index of 3. In a multibyte world if the start value is the character count the first thing it does is scan the string to find the byte offset where the 3rd character begins. However, it is likely that this same byte position is known from immediately prior work on the string. So passing byte length around saves frequent rescanning of the string. An important caveat is that if the string is modified the byte counts have to be thrown away, or at least those after the string is modified. On the other hand, most existing code is doing character count arithmetic and changing it means replacing simple indexing with functions to get byte offsets. It is harder to convert the code. It is of course possible to make the intl extension much smarter and remember index to byte mappings, but we didn't have time in the initial version. $start = 3; //does stuff at 3 and then wants to do stuff 4 characters after this position. $extractbegin = $start + 4; $ext = $substr( $mystr, $extractbegin, $len); Becomes code that has to: Call a function to find the byte offset of character 3 in the string (by scanning). Needs 2 variables to remember both current character count and byte count Needs to call a function to find the byte offset of character 7 by either scanning from the beginning of the string or starting from the known offset of character 3. $ext=graphemeextract.... My preference is for start to optionally be grapheme or character count and let the migration be quick and then add optimizations into the extension to recognize strings that are ascii, cache recently used offsets, etc. But that's just me... For most programs the performance enhancement of using byte offsets is countered by the extra function calls etc. Especially for the typically short strings. (Scanning large buffers repeatedly for offsets into the last few characters can hurt, but can usually be worked around thru other optimizations.) And making the migration difficult will reduce the number of programs that actually support languages that need graphemes... This wasn't your decision so no reflection on you of course. Next version should add in support for start values to be grapheme counts.... tex > -----Original Message----- > From: Ed Batutis [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 08, 2008 12:42 PM > To: Texin, Tex; php-i18n@lists.php.net > Subject: RE: [PHP-I18N] proposal: unification of the > grapheme_extract functions > > > > If I use GRAPHEME_EXTR_MAXBYTES, does it return ... > > > I assume it is the max # of whole graphemes that do not > exceed the max > > bytes. > > Yes. It works just like the old grapheme_extractb. > > > Also, the $start value is that in byte, character or grapheme units > > for each of the types? > > The start value is always bytes. I was unsure if this made > sense, really, but it is consistent (and easy to implement). > > =Ed > > > -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php