[power-pro] Re: Unicode bugs?

Sheri Sun, 16 Aug 2009 10:47:24 -0700

--- In [email protected], "entropyreduction" 
<alancampbelllists+ya...@...> wrote:
>
> --- In [email protected], "Sheri" <sherip99@> wrote:
> > > Wouldn't it be simpler just to keep with the current case flags?
> > > Woyld just be a simple extra test if such a flag found ("is
> > > unicode present? then go thataway).
> > 
> > Yes, I just doubted that the process knows if the "utf8" option is in 
> > effect when processing $u0, etc. 
> 
> I checked that out, possible to propagate "utf8" option through to replacment 
> string parsing.  Already made the change.
> 
> > Thought adding extra triggers might be cleaner. Also that changing $u0 to 
> > always test for utf8 penalizes the usual non-utf8 situation for every match 
> > in a multiple match situation where a case-modifier is in the format 
> > string. Am I too conservative?
> 
> I either got to test for the new option letters (extending a switch 
> statement) or test a simple binary within each branch of existing switch.  
> Not much in it.
> 
> > This regex feature will likely get little usage. Yes, when/if
> > ever needed, it would be very convenient if the format string
> > handled it. When processing utf8 regex at that level, you can be
> > confident that the unicode plugin is running. If existing unicode
> > methods for effecting case changes in utf8 prove to be too
> > inefficient, don't you think the logical place for adding utf8
> > convenience functions would be the unicode plugin?
>  
> Maybe, it's a thought. I'd need to extend the unicode interplugin
> api. There's still the overhead of UTF-8/unicode/UTF-8
> conversion, no matter where it's done.
> 
> Here's a thing.  The existing regex case conversion stuff takes advantage of 
> conversion not affecting string length; no need to allocate space to accept a 
> string possibly bigger than the one one started with.  
> 
> Suppose I have a lower case UTF-8 string. Any way to know if the
> result of conversion to upper case will be the same length. In
> other words, is there any pattern in UTF-8 that says upper/lower
> case forms of same letter take same number of bytes? I'll have a
> look, myself, but if you find any rule on the subject, useful to
> know.


Probably so long as there is a one-to-one unicode character case mapping 
between upper and lower, the utf8 encoding between them will be the same length.

<ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt>

You can see some exceptions in the file.

The exceptions are discussed in 
<ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt>. The exceptions look 
to be odd cases and a lot of them are in Greek language.

Following is the full unicode data base.

<ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>

Columns 12, 13 and 14 are upper case, lower case and title case mappings. Upper 
or lower can be omitted if the same as the code point itself, title can be 
omitted if the same as upper. The full data base is only supposed to have 
something where there is a one to one mapping. Yet, I see the latin capital I 
supposely has complications according to SpecialCasing.txt.

Also this page may have potentially useful info:
<http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/>

Regards,
Sheri

[power-pro] Re: Unicode bugs?

Reply via email to