Re: [The Java Posse] Dick, that's not how you compare strings!

Alexey Zinger Thu, 05 Aug 2010 16:48:17 -0700

Very interesting read, but I gotta ask about toUpperCase(Locale).  You've 
soundly debunked the idea of using toLowerCase(Locale) as a reliable 
internationalizable way of comparing strings irrespective of capitalization.  
Am 
I wrong in considering toUpperCase(Locale) as a more suitable alternative?  It 
seems your examples would lead one to believe so.

 Alexey

________________________________
From: Reinier Zwitserloot <[email protected]>
To: The Java Posse <[email protected]>
Sent: Wed, August 4, 2010 10:07:54 AM
Subject: [The Java Posse] Dick, that's not how you compare strings!

This is a long post that is mostly a fun walk through a few different
languages used on planet earth, showing why the entire concept of
'upper' and 'lower' case is mostly broken if one ascribes to the
theory that english isn't the only language around. The summary, in
case you don't want to read it all, boils down to this: (A) Use
String.CASE_INSENSITIVE_ORDER, it's better than toLowerCase()ing both
arguments, and (B) actually, don't use that either, use
java.util.Collator.

But, if you have a casual interest in learning something, by all means
read on!

>From one of the many scala threads, dick uses .toLowerCase() as a
placeholder for case insensitive string comparison, arguing that for
example this:

list.sort(_.toLowerCase);

is much simpler than the java alternative, which in Dick's example is
e.g. an Anonymous inner class literal to supply an implementation of
Comparator.

That's a flawed example. The right way to do this in java would be:

Collections.sort(list, String.CASE_INSENSITIVE_ORDER);

but it gets much worse, because this just isn't how one should ever
compare strings. I'll give you three examples:

EXAMPLE 1: The Eszett character.

In german, a certain ligature for two consecutive s characters has
been used so much for so long it's considered its own unique
character. The Eszett or 'Scharfes s': ß. It shows up all over the
place in german, for example in the word street, which is straße in
german. The Eszett is somewhat unique in that it has no capital form.
Words don't start with double s so there has never been a reason to
capitalize it. Instead, if you ask a german to write street in all
caps, they'd write "STRASSE".

Uhoh. That's not a reversible operation! The lowercase form of
"STRASSE" is "strasse", and even knowing that the locale is german,
you can't just guess that double-s *ALWAYS* becomes ß character. You'd
need a dictionary. Arguably you're better off doing that conversion
than not doing it if you know its german (more likely), but it won't
be a sure thing. Nevertheless, trying to check if "straße" and
"STRASSE" are two equal strings under case insensitive comparison
SHOULD return true, and not false, but while that's in the cards,
using .toLowerCase() as a proxy for case insensitive comparison is
never going to get us that.

Java screws this up, though. Java has chosen to adopt the unicode
invention of the capital Eszett, which nobody in germany uses and is
clearly a silly idea: You don't hack a living, breathing language to
get around i18n problems. If that's feasible we might as well tell the
world to stop whining and only use english when using machines. Even
more dire, String.CASE_INSENSITIVE_ORDER does not in fact return
'0' (indicating equality) when comparing straße and STRASSE, even
though every single language that uses the ß (german. That's it!) says
those two are case-insensitive equal. Nevertheless, the fact that java
has a bug here does not excuse the brokenness of the 'toLowerCase() is
an alternative for case insensitive comparison' meme.

EXAMPLE 2: The turkish dotless/dotted i and the somewhat famous: PHP
is completely broken on Turkish machines problem.

The ß vs. SS issue indicates that toLowerCase() isn't a valid
replacement for case insensitive comparison, but it gets worse. You
can't, in fact, say anything whatsoever about casing, or case
insensitive equality, without knowing the locale you're operating in.
In turkish there's not one 'i' but two: The dotted i and the dotless
i. Where this gets crazy is the capital forms of those. The capital
form of 'i' is a _dotted_ capital I, and the capital form of dotless
'i' is our normal, familiar capital I, which doesn't have a dot in it.
This even messes with kerning (fi is a common kerning where the tip of
the curvey top of the f is joined with the i's dot. That's not
appropriate in turkish, where that dot is important). This means "i"
and "I" are not equal in case-insensitive comparison in the Turkish
locale. .toLowerCase() comparison gets this wrong, of course.
toLowerCase(new Locale("TR")) would actually get this right.
String.CASE_INSENSITIVE_ORDER gets this wrong, because you can't give
it a locale. For natural language comparison (and why are you upper/
lowering strings in the first place, if not because youre doing
natural language comparison?), tLC(), tUC(),
String.CASE_INSENSITIVE_ORDER - are all inherently broken. Java gets
this a little right and offers Locale-based variants of tLC() and
tUC(), as I hinted at earlier.

That PHP we always like to bash on? One of the things PHP does (or
perhaps did, I don't keep up with it) wrong for the longest of times
was that it would completely lose its marbles if the system's locale
was set to turkish. It would just fail to work. PHP identifiers are
defined as case insensitive, and PHP implemented this by
toLowerCase()ing everything, using the platform default encoding. This
turns "FILE" into "fıle" which is not equal to "file", and thus
running most PHP code on a machine configured with turkish locale
breaks, if the PHP is in english. This is a famous example of the
universally lamented "it works for me" attitude. So, yes, making this
mistake has dire consequences.

EXAMPLE 3: ascii hacks.

To make matters worse still, lets think about why one would want to do
a case insensitive comparison in the first place. Presumably because
there's some user input of some sort that needs to be compared, and
you don't want to bother the user with case sensitivity. However, if
thats the aim, "case insensitive" is entirely the wrong idea. There
are a bajillion systems around that can't deal with unicode. For
example, if many 'name' forms don't even accept a dash in your last
name, how many do you think accept a "ü" in it? And yet plenty of
german folks are called "Müller". The 'fix' these poor saps have used
for years and years is to write the canonical ascii alternative for
their funky character. ß becomes ss, ü because ue, etcetera. I'm
betting that if you intend for "JOE" and "joe" to be equal, then you
should consider "müller" and "mueller" equal too.

Whats really needed is a human inputted string comparator. Optimally
speaking such a tool will first canonicalize each string, turning for
example dotless i of any capitalization into a dotted lowercase i,
turns ß into ss, ü into ue, etcetera, and only then compare the
strings. I'm not even sure this can be done properly without knowing a
locale, but you could do a lot better than
String.CASE_INSENSITIVE_ORDER that way, and far better still versus
comparing the .toLowerCase() versions of any two given strings.

Java does in fact have something for this: java.text.Collator. This
indeed does more or less what I just described, and it is in fact what
one ought to be using. If I were you, I'd make a pmd plugin right now
that checks for usage of locale-less tLC and tUC, as well as
String.CASE_INSENSITIVE_ORDER, and flag em all as warnings.

BONUS EXAMPLE: Lower case. Upper case. And Title case??.

Astute folks may have observed that Character.toTitleCase() exists.
What's that you ask? Well, in some languages, there are single
characters that represent two 'sounds'. For example the 'dz'. That's
one character in some languages. This is very similar to the germans
who have enshrined the much used ß ligature into a unique character.
However, unlike the ß which has pretty much lost all resemblance to
the original character, and which can never appear at the beginning of
a word, these characters are mostly just a kerned version of the
original, and CAN appear at the beginning of the word. In unicode
speak these are called digraphs. These have 3 and not 2 capitalization
forms: all lowercase, all uppercase, and the first part of the digraph
uppercased, the second part lowercased. This is what you'd use when
you want a word with just the first letter capitalized, and you use
the all-caps version only if you need an all-caps rendering of the
word (i.e. rarely). This obscure little factoid actually was useful
for me when writing lombok: The method that turns "foo" into "getFoo"
will use toTitleCase() and not .toUpperCase().

And there ends our trip round the world. I hope you enjoyed it. Though
now you know: That feeling of despair when someone mentions i18n?
You're entirely correct in feeling it. It's a pain in the tusch!

-- 
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en.

-- 
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en.

Re: [The Java Posse] Dick, that's not how you compare strings!

Reply via email to