[The Java Posse] Re: Dick, that's not how you compare strings!

Ben Schulz Sat, 07 Aug 2010 22:06:10 -0700

I think the previously mentioned combining sequences would disqualify
both toLowerCase and toUpperCase (as well as a &&-reduce of a per-
char-|| of both). In any case, really interesting thread, thanks for
pointing it out.


With kind regards
Ben

On 8 Aug., 00:27, Reinier Zwitserloot <[email protected]> wrote:
> No; it's true that in the particular examples I
> showed .toUpperCase(Locale) would have worked slightly better, but
> these were just a small selection from a gigantic pile of weirdness.
> Interestingly enough the only other case I can think of offhand is the
> greek sigma which has 2 lowercase forms, though there
> again .toUpperCase(Locale) would work a bit better. Still, I'll eat my
> shoes if there's just no possible way .toUpperCase() can ever go
> wrong.
>
> java.text.Collator is it.
>
> On Aug 6, 1:48 am, Alexey Zinger <[email protected]> wrote:
>
> > Very interesting read, but I gotta ask about toUpperCase(Locale).  You've
> > soundly debunked the idea of using toLowerCase(Locale) as a reliable
> > internationalizable way of comparing strings irrespective of 
> > capitalization.  Am
> > I wrong in considering toUpperCase(Locale) as a more suitable alternative?  
> > It
> > seems your examples would lead one to believe so.
>
> >  Alexey
>
> > ________________________________
> > From: Reinier Zwitserloot <[email protected]>
> > To: The Java Posse <[email protected]>
> > Sent: Wed, August 4, 2010 10:07:54 AM
> > Subject: [The Java Posse] Dick, that's not how you compare strings!
>
> > This is a long post that is mostly a fun walk through a few different
> > languages used on planet earth, showing why the entire concept of
> > 'upper' and 'lower' case is mostly broken if one ascribes to the
> > theory that english isn't the only language around. The summary, in
> > case you don't want to read it all, boils down to this: (A) Use
> > String.CASE_INSENSITIVE_ORDER, it's better than toLowerCase()ing both
> > arguments, and (B) actually, don't use that either, use
> > java.util.Collator.
>
> > But, if you have a casual interest in learning something, by all means
> > read on!
>
> > From one of the many scala threads, dick uses .toLowerCase() as a
> > placeholder for case insensitive string comparison, arguing that for
> > example this:
>
> > list.sort(_.toLowerCase);
>
> > is much simpler than the java alternative, which in Dick's example is
> > e.g. an Anonymous inner class literal to supply an implementation of
> > Comparator.
>
> > That's a flawed example. The right way to do this in java would be:
>
> > Collections.sort(list, String.CASE_INSENSITIVE_ORDER);
>
> > but it gets much worse, because this just isn't how one should ever
> > compare strings. I'll give you three examples:
>
> > EXAMPLE 1: The Eszett character.
>
> > In german, a certain ligature for two consecutive s characters has
> > been used so much for so long it's considered its own unique
> > character. The Eszett or 'Scharfes s': ß. It shows up all over the
> > place in german, for example in the word street, which is straße in
> > german. The Eszett is somewhat unique in that it has no capital form.
> > Words don't start with double s so there has never been a reason to
> > capitalize it. Instead, if you ask a german to write street in all
> > caps, they'd write "STRASSE".
>
> > Uhoh. That's not a reversible operation! The lowercase form of
> > "STRASSE" is "strasse", and even knowing that the locale is german,
> > you can't just guess that double-s *ALWAYS* becomes ß character. You'd
> > need a dictionary. Arguably you're better off doing that conversion
> > than not doing it if you know its german (more likely), but it won't
> > be a sure thing. Nevertheless, trying to check if "straße" and
> > "STRASSE" are two equal strings under case insensitive comparison
> > SHOULD return true, and not false, but while that's in the cards,
> > using .toLowerCase() as a proxy for case insensitive comparison is
> > never going to get us that.
>
> > Java screws this up, though. Java has chosen to adopt the unicode
> > invention of the capital Eszett, which nobody in germany uses and is
> > clearly a silly idea: You don't hack a living, breathing language to
> > get around i18n problems. If that's feasible we might as well tell the
> > world to stop whining and only use english when using machines. Even
> > more dire, String.CASE_INSENSITIVE_ORDER does not in fact return
> > '0' (indicating equality) when comparing straße and STRASSE, even
> > though every single language that uses the ß (german. That's it!) says
> > those two are case-insensitive equal. Nevertheless, the fact that java
> > has a bug here does not excuse the brokenness of the 'toLowerCase() is
> > an alternative for case insensitive comparison' meme.
>
> > EXAMPLE 2: The turkish dotless/dotted i and the somewhat famous: PHP
> > is completely broken on Turkish machines problem.
>
> > The ß vs. SS issue indicates that toLowerCase() isn't a valid
> > replacement for case insensitive comparison, but it gets worse. You
> > can't, in fact, say anything whatsoever about casing, or case
> > insensitive equality, without knowing the locale you're operating in.
> > In turkish there's not one 'i' but two: The dotted i and the dotless
> > i. Where this gets crazy is the capital forms of those. The capital
> > form of 'i' is a _dotted_ capital I, and the capital form of dotless
> > 'i' is our normal, familiar capital I, which doesn't have a dot in it.
> > This even messes with kerning (fi is a common kerning where the tip of
> > the curvey top of the f is joined with the i's dot. That's not
> > appropriate in turkish, where that dot is important). This means "i"
> > and "I" are not equal in case-insensitive comparison in the Turkish
> > locale. .toLowerCase() comparison gets this wrong, of course.
> > toLowerCase(new Locale("TR")) would actually get this right.
> > String.CASE_INSENSITIVE_ORDER gets this wrong, because you can't give
> > it a locale. For natural language comparison (and why are you upper/
> > lowering strings in the first place, if not because youre doing
> > natural language comparison?), tLC(), tUC(),
> > String.CASE_INSENSITIVE_ORDER - are all inherently broken. Java gets
> > this a little right and offers Locale-based variants of tLC() and
> > tUC(), as I hinted at earlier.
>
> > That PHP we always like to bash on? One of the things PHP does (or
> > perhaps did, I don't keep up with it) wrong for the longest of times
> > was that it would completely lose its marbles if the system's locale
> > was set to turkish. It would just fail to work. PHP identifiers are
> > defined as case insensitive, and PHP implemented this by
> > toLowerCase()ing everything, using the platform default encoding. This
> > turns "FILE" into "fıle" which is not equal to "file", and thus
> > running most PHP code on a machine configured with turkish locale
> > breaks, if the PHP is in english. This is a famous example of the
> > universally lamented "it works for me" attitude. So, yes, making this
> > mistake has dire consequences.
>
> > EXAMPLE 3: ascii hacks.
>
> > To make matters worse still, lets think about why one would want to do
> > a case insensitive comparison in the first place. Presumably because
> > there's some user input of some sort that needs to be compared, and
> > you don't want to bother the user with case sensitivity. However, if
> > thats the aim, "case insensitive" is entirely the wrong idea. There
> > are a bajillion systems around that can't deal with unicode. For
> > example, if many 'name' forms don't even accept a dash in your last
> > name, how many do you think accept a "ü" in it? And yet plenty of
> > german folks are called "Müller". The 'fix' these poor saps have used
> > for years and years is to write the canonical ascii alternative for
> > their funky character. ß becomes ss, ü because ue, etcetera. I'm
> > betting that if you intend for "JOE" and "joe" to be equal, then you
> > should consider "müller" and "mueller" equal too.
>
> > Whats really needed is a human inputted string comparator. Optimally
> > speaking such a tool will first canonicalize each string, turning for
> > example dotless i of any capitalization into a dotted lowercase i,
> > turns ß into ss, ü into ue, etcetera, and only then compare the
> > strings. I'm not even sure this can be done properly without knowing a
> > locale, but you could do a lot better than
> > String.CASE_INSENSITIVE_ORDER that way, and far better still versus
> > comparing the .toLowerCase() versions of any two given strings.
>
> > Java does in fact have something for this: java.text.Collator. This
> > indeed does more or less what I just described, and it is in fact what
> > one ought to be using. If I were you, I'd make a pmd plugin right now
> > that checks for usage of locale-less tLC and tUC, as well as
> > String.CASE_INSENSITIVE_ORDER, and flag em all as warnings.
>
> > BONUS EXAMPLE: Lower case. Upper case. And Title case??.
>
> > Astute folks may have observed that Character.toTitleCase() exists.
> > What's that you ask? Well, in some languages, there are single
> > characters that represent two 'sounds'. For example the 'dz'. That's
> > one character in some languages. This is very similar to the germans
> > who have enshrined the much used ß ligature into a unique character.
> > However, unlike the ß which has pretty much lost all resemblance to
> > the original character, and which can never appear at the beginning of
> > a word, these characters are mostly just a kerned version of the
> > original, and CAN appear at the beginning of the word. In unicode
> > speak these are called digraphs. These have 3 and not 2 capitalization
> > forms: all lowercase, all uppercase, and the first part of the digraph
> > uppercased, the second part lowercased. This is what you'd use when
> > you want a word with just the first letter capitalized, and you use
> > the all-caps version only if you need an all-caps rendering of the
> > word (i.e. rarely). This obscure little factoid actually was useful
> > for me when writing lombok: The method that turns "foo" into "getFoo"
> > will use toTitleCase() and not .toUpperCase().
>
> > And there ends our trip round the world. I hope you enjoyed it. Though
> > now you know: That feeling of despair when someone mentions i18n?
> > You're entirely correct in feeling it. It's a pain in the tusch!
>
> ...
>
> Erfahren Sie mehr »

-- 
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en.

[The Java Posse] Re: Dick, that's not how you compare strings!

Reply via email to