Rasmus:

At 6:54 PM -0700 6/5/06, Rasmus Lerdorf wrote:
>tedd wrote:
>>For example, the Unicode issue was raised during this discussion -- if php 
>>doesn't consider the numeric relationship of characters, then I see a big 
>>problem waiting in the wings. Because if we're having these types of 
>>discussions with just considering 00-7F characters, then I can only guess at 
>>what's going to happen when we start considering 000000-FFFFFF code-points.
>>
>>Now, was that enough said?  :-)
>
>I don't think you really understand this.  < and > are collation operators 
>when they operate on strings.  They have absolutely nothing to do with the 
>numeric values of the characters.

What's to understand?  It's the pecking order of strings  -- it's the system of 
how one sorts strings. It's the way I tried to order my books in college. We've 
been doing it all our life and now you think I don't understand how to sort 
stuff? It's not the "white and colored" clothes thing my wife keeps talking 
about, is it?

Look, I understand collation. I also understand that collation is different for 
different languages and for different needs. In some cases, greatly different.

The point of this discussion was how php collates/sorts or otherwise orders 
characters/strings when given the operation to increment from "a" to "z".

As this thread has demonstrated, there's a wide range of expectations as to how 
that should happen.

The reference you provided touches upon some of the problems that collation 
faces when trying to develop collation systems for different needs. This 
discussion was no different.

>It just so happens that in English iso-8859-1 there is a 1:1 relationship 
>between the numeric values and the collation order, but you can think of that 
>as dumb luck.

No,  English iso-8859-1 was designed to conform to the ASCII standard-- the 
same as Unicode and other standards that followed. It's not dumb luck to make 
"standards" backward compatible, it's by good design.

Considering that you're dealing with English iso-8895 and ASCII (developed by 
"American" Standard Code), then I think the connection between numeric values 
and collation went hand-in-hand by design. It was not by accident.

It's just too bad that the "powers-the-be" at the time didn't realize that 
7-bit wouldn't cover everything to come in the near future.

>To better understand this, I suggest you start reading here:
>
>  http://icu.sourceforge.net/userguide/Collate_Intro.html
>
>Note one of the points on that page.  That in Lithuanian 'y' falls between 'i' 
>and 'k'.  So even without going into Unicode and just using low-ascii, you 
>have these issues.

I don't have these issues because I'm not Lithuanian. If a Lithuanian php 
programmer wants "y" to fall between "i" and "k" in a loop, then good luck -- 
for I can't get it to stop when it passes "z" -- which I think it should.

But, as I am aware, there is no low-ASCII, there is no high-ASCII, there is 
simply ASCII. While it is common to use the term of extended-ASCII, it's a 
misnomer because American Standards Association had nothing to do with 
establishing/defining any character above DEC 126.

The above example you referenced is simply one of many and demonstrates that 
the "collation" problem is very complex. You should look into how Unicode 
performs canonical ordering in combining characters such as using an accent, 
umlaut, or cedilla as well as how that combination affects collation in 
different languages as stated in your reference. You will see that canonical 
ordering  algorithm is numeric.

Yes, I'm very aware of Unicode. I'm aware enough to know that they have 
assigned numerical equivalents to every glyph known to man including those 
combining glyphs such as those mentioned above to produce combination 
characters. When I say every "glyph known to man", that includes much more than 
language specific glyphs.

I'm also aware of IDNS and how they implement Unicode, which is not inclusive. 
Take for example case mapping which IDNS simply translates all of what they 
perceive to be uppercase to lowercase. Some characters are combination 
characters when lowercase and a single character when uppercase, thus there is 
no lowercase representation for the uppercase character. Oops, I just lost the 
16th century (Roller Ball).

Now, I can appreciate the problems facing php considering that it has to deal 
with not only Unicode, but with also with the IDNS when addressing Unicode and 
the Internet. But that problem is not going to be solved by ignoring that 
Unicode code-points have numeric (and other) values. I would think that serious 
collation systems use numeric values in some fashion in their algorithms -- 
don't they? If not, please explain how they detect differences between 
characters and group them into collation tables.

>Now, until we get to PHP 6, we don't have decent Unicode support and we don't 
>have LOCALE-aware operators.  You will have to manually use strcoll() to get 
>them, but that is going to change and you will have the ICU collation 
>algorithms available and for Unicode strings it will be automatic.  You can 
>still have binary-strings if you don't want locale-aware collation, of course.

Well, good luck with that.

tedd
-- 
------------------------------------------------------------------------------------
http://sperling.com  http://ancientstones.com  http://earthstones.com

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to