RE: [PHP-I18N] proposal: unification of the grapheme_extract functions

Texin, Tex Tue, 13 May 2008 05:38:49 -0700

Hi,

I disagree with case 2 as it is described. You don't want to truncate in the 
middle of a grapheme, if you in fact have graphemes.


Basically and ideally, there should be only 3 use cases:

A) You are working with graphemes, and ideally you would program with grapheme 
indexes and counts (start and length in grapheme units).

B) Because of fixed width buffers, you need to specify a max length in bytes, 
but the function should only extract whole graphemes (start in graphemes, 
length in bytes).

C) You are not working with text but bytes (which really shouldn't be in this 
discussion, but for completeness...) and so start and length in bytes.

But we don't live in an ideal world.
Grapheme based processing is more expensive than character processing, and this 
is more expensive than byte processing.

If this weren't so, we would only have to deal with A, B, and C and programming 
would be simpler.

It is a bad assumption for i18n, but if you are not dealing with Indic or 
Middle Eastern languages, then you know you have characters not graphemes, and 
so why pay the cost?
Also, graphemes are newly supported, so existing code is character based.

Therefore we need to offer the character support.  Analagous to A and B, we 
need CA (start and length in character units) and CB (start in character units 
and length in bytes)

But again performance rears its ugly head. Having start be in character or 
grapheme units, means the function always scans thru start number of units to 
find the beginning offset. Hence the desire to offer start position in bytes, 
giving us a version of A and CA that starts with bytes, and a version of B and 
CB that specifies start and length in bytes but returns a whole number of 
graphemes or characters, as appropriate.

The final ugliness is we have some of these functions in the plain (or non-mb) 
flavor and the mb_string flavor.
So, we could say for graphemes use grapheme_substr, and for character use mb 
functions, and for bytes use the plain functions (or the other way around (I 
think mb overloads the plain with the character based and provides the byte 
versions in the mb form... I always have to look to check.)

But, some of the mb functions are not implemented well so I don't trust them, 
which you can chalk up to my personal idiosyncrasy. The more salient point is 
it is confusing for people to have to sort thru all the function flavors with 
different names. I would prefer to have the choices in one function with 
options and an explanation of when to use what, perhaps derived from the above 
logic. And I would deprecate the related functions in mb and plain.

That said, if this is all that's holding up the release, I would release with 
the byte start and add the other flavors in the next version.
People can always use grapheme_length/mb_length(or whatever it is) to get the 
starting byte position and perhaps write their own function to calculate the 
byte start and call the grapheme_substr function.
It is a nuisance but if they understand that they can migrate easily.

Let's wrap this up.


tex


> -----Original Message-----
> From: Ed Batutis [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 12, 2008 1:01 PM
> To: 'Stanislav Malyshev'
> Cc: Texin, Tex; php-i18n@lists.php.net
> Subject: RE: [PHP-I18N] proposal: unification of the 
> grapheme_extract functions
> 
> > Maybe I just misunderstand the use case for the extract function - 
> > what it's supposed to do that substr, mb_substr and grapheme_substr 
> > can't or do worse?
> 
> Tex could probably answer this better than I could, but I'll 
> have a go.
> 
> Use case 1: You have a buffer that is a fixed number of bytes 
> long. You need to fill it up as far as you can with whole 
> graphemes. You are probably sending that buffer to another 
> API that might not be grapheme - or even Unicode - aware. You 
> are in a loop so you are tracking your position in the 
> original string. This is how the discussion got started about 
> how the 'start' parameter is defined - it isn't clear how the 
> position would be tracked. I assumed a byte count because the 
> user can simply do a strlen on the return string to update 
> his position, but Tex thinks this isn't as handy as it should 
> be. It depends on the details of the algorithm I guess.
> 
> Use case 2: Same as above except in this case it is an Oracle 
> database buffer where your columns are defined as being N 
> Unicode characters (not bytes or graphemes) long.
> 
> Use case 3 (a generalization of use case 1 really): You have 
> some code that knows about bytes or Unicode characters but 
> nothing about graphemes. You want to update the code so it is 
> grapheme aware. You can't completely abandon a byte count or 
> character count in the code for some reason, but you want to 
> easily update the code to process whole graphemes.
> 
> 
> =Ed
> 
> 
> 

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP-I18N] proposal: unification of the grapheme_extract functions

Reply via email to