Hi, I disagree with case 2 as it is described. You don't want to truncate in the middle of a grapheme, if you in fact have graphemes.
Basically and ideally, there should be only 3 use cases: A) You are working with graphemes, and ideally you would program with grapheme indexes and counts (start and length in grapheme units). B) Because of fixed width buffers, you need to specify a max length in bytes, but the function should only extract whole graphemes (start in graphemes, length in bytes). C) You are not working with text but bytes (which really shouldn't be in this discussion, but for completeness...) and so start and length in bytes. But we don't live in an ideal world. Grapheme based processing is more expensive than character processing, and this is more expensive than byte processing. If this weren't so, we would only have to deal with A, B, and C and programming would be simpler. It is a bad assumption for i18n, but if you are not dealing with Indic or Middle Eastern languages, then you know you have characters not graphemes, and so why pay the cost? Also, graphemes are newly supported, so existing code is character based. Therefore we need to offer the character support. Analagous to A and B, we need CA (start and length in character units) and CB (start in character units and length in bytes) But again performance rears its ugly head. Having start be in character or grapheme units, means the function always scans thru start number of units to find the beginning offset. Hence the desire to offer start position in bytes, giving us a version of A and CA that starts with bytes, and a version of B and CB that specifies start and length in bytes but returns a whole number of graphemes or characters, as appropriate. The final ugliness is we have some of these functions in the plain (or non-mb) flavor and the mb_string flavor. So, we could say for graphemes use grapheme_substr, and for character use mb functions, and for bytes use the plain functions (or the other way around (I think mb overloads the plain with the character based and provides the byte versions in the mb form... I always have to look to check.) But, some of the mb functions are not implemented well so I don't trust them, which you can chalk up to my personal idiosyncrasy. The more salient point is it is confusing for people to have to sort thru all the function flavors with different names. I would prefer to have the choices in one function with options and an explanation of when to use what, perhaps derived from the above logic. And I would deprecate the related functions in mb and plain. That said, if this is all that's holding up the release, I would release with the byte start and add the other flavors in the next version. People can always use grapheme_length/mb_length(or whatever it is) to get the starting byte position and perhaps write their own function to calculate the byte start and call the grapheme_substr function. It is a nuisance but if they understand that they can migrate easily. Let's wrap this up. tex > -----Original Message----- > From: Ed Batutis [mailto:[EMAIL PROTECTED] > Sent: Monday, May 12, 2008 1:01 PM > To: 'Stanislav Malyshev' > Cc: Texin, Tex; php-i18n@lists.php.net > Subject: RE: [PHP-I18N] proposal: unification of the > grapheme_extract functions > > > Maybe I just misunderstand the use case for the extract function - > > what it's supposed to do that substr, mb_substr and grapheme_substr > > can't or do worse? > > Tex could probably answer this better than I could, but I'll > have a go. > > Use case 1: You have a buffer that is a fixed number of bytes > long. You need to fill it up as far as you can with whole > graphemes. You are probably sending that buffer to another > API that might not be grapheme - or even Unicode - aware. You > are in a loop so you are tracking your position in the > original string. This is how the discussion got started about > how the 'start' parameter is defined - it isn't clear how the > position would be tracked. I assumed a byte count because the > user can simply do a strlen on the return string to update > his position, but Tex thinks this isn't as handy as it should > be. It depends on the details of the algorithm I guess. > > Use case 2: Same as above except in this case it is an Oracle > database buffer where your columns are defined as being N > Unicode characters (not bytes or graphemes) long. > > Use case 3 (a generalization of use case 1 really): You have > some code that knows about bytes or Unicode characters but > nothing about graphemes. You want to update the code so it is > grapheme aware. You can't completely abandon a byte count or > character count in the code for some reason, but you want to > easily update the code to process whole graphemes. > > > =Ed > > > -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php