[The Java Posse] Re: Dick, that's not how you compare strings!

Reinier Zwitserloot Wed, 04 Aug 2010 17:34:06 -0700

Whoops, my bad. "straße" to uppercase in java *IS* the correct
"STRASSE", even without using Locale.GERMANY. I screwed up my encoding
(d'oh)!


Of course, "STRASSE" to lowercase remains "strasse" regardless of
locale, and "STRASSE" and "straße" aren't equal according to
String.CASE_INSENSITIVE_ORDER, so, only a small win. Still. If you
want to experiment, \u00DF is safer to play with then actually hitting
alt+S (mac only).

For those who care, some extra commentary on reddit, where someone
figured this one out: 
http://www.reddit.com/r/programming/comments/cxb8h/case_insensitive_comparison_a_lot_more/


On Aug 4, 4:07 pm, Reinier Zwitserloot <[email protected]> wrote:
> This is a long post that is mostly a fun walk through a few different
> languages used on planet earth, showing why the entire concept of
> 'upper' and 'lower' case is mostly broken if one ascribes to the
> theory that english isn't the only language around. The summary, in
> case you don't want to read it all, boils down to this: (A) Use
> String.CASE_INSENSITIVE_ORDER, it's better than toLowerCase()ing both
> arguments, and (B) actually, don't use that either, use
> java.util.Collator.
>
> But, if you have a casual interest in learning something, by all means
> read on!
>
> From one of the many scala threads, dick uses .toLowerCase() as a
> placeholder for case insensitive string comparison, arguing that for
> example this:
>
> list.sort(_.toLowerCase);
>
> is much simpler than the java alternative, which in Dick's example is
> e.g. an Anonymous inner class literal to supply an implementation of
> Comparator.
>
> That's a flawed example. The right way to do this in java would be:
>
> Collections.sort(list, String.CASE_INSENSITIVE_ORDER);
>
> but it gets much worse, because this just isn't how one should ever
> compare strings. I'll give you three examples:
>
>  EXAMPLE 1: The Eszett character.
>
> In german, a certain ligature for two consecutive s characters has
> been used so much for so long it's considered its own unique
> character. The Eszett or 'Scharfes s': ß. It shows up all over the
> place in german, for example in the word street, which is straße in
> german. The Eszett is somewhat unique in that it has no capital form.
> Words don't start with double s so there has never been a reason to
> capitalize it. Instead, if you ask a german to write street in all
> caps, they'd write "STRASSE".
>
> Uhoh. That's not a reversible operation! The lowercase form of
> "STRASSE" is "strasse", and even knowing that the locale is german,
> you can't just guess that double-s *ALWAYS* becomes ß character. You'd
> need a dictionary. Arguably you're better off doing that conversion
> than not doing it if you know its german (more likely), but it won't
> be a sure thing. Nevertheless, trying to check if "straße" and
> "STRASSE" are two equal strings under case insensitive comparison
> SHOULD return true, and not false, but while that's in the cards,
> using .toLowerCase() as a proxy for case insensitive comparison is
> never going to get us that.
>
> Java screws this up, though. Java has chosen to adopt the unicode
> invention of the capital Eszett, which nobody in germany uses and is
> clearly a silly idea: You don't hack a living, breathing language to
> get around i18n problems. If that's feasible we might as well tell the
> world to stop whining and only use english when using machines. Even
> more dire, String.CASE_INSENSITIVE_ORDER does not in fact return
> '0' (indicating equality) when comparing straße and STRASSE, even
> though every single language that uses the ß (german. That's it!) says
> those two are case-insensitive equal. Nevertheless, the fact that java
> has a bug here does not excuse the brokenness of the 'toLowerCase() is
> an alternative for case insensitive comparison' meme.
>
>  EXAMPLE 2: The turkish dotless/dotted i and the somewhat famous: PHP
> is completely broken on Turkish machines problem.
>
> The ß vs. SS issue indicates that toLowerCase() isn't a valid
> replacement for case insensitive comparison, but it gets worse. You
> can't, in fact, say anything whatsoever about casing, or case
> insensitive equality, without knowing the locale you're operating in.
> In turkish there's not one 'i' but two: The dotted i and the dotless
> i. Where this gets crazy is the capital forms of those. The capital
> form of 'i' is a _dotted_ capital I, and the capital form of dotless
> 'i' is our normal, familiar capital I, which doesn't have a dot in it.
> This even messes with kerning (fi is a common kerning where the tip of
> the curvey top of the f is joined with the i's dot. That's not
> appropriate in turkish, where that dot is important). This means "i"
> and "I" are not equal in case-insensitive comparison in the Turkish
> locale. .toLowerCase() comparison gets this wrong, of course.
> toLowerCase(new Locale("TR")) would actually get this right.
> String.CASE_INSENSITIVE_ORDER gets this wrong, because you can't give
> it a locale. For natural language comparison (and why are you upper/
> lowering strings in the first place, if not because youre doing
> natural language comparison?), tLC(), tUC(),
> String.CASE_INSENSITIVE_ORDER - are all inherently broken. Java gets
> this a little right and offers Locale-based variants of tLC() and
> tUC(), as I hinted at earlier.
>
> That PHP we always like to bash on? One of the things PHP does (or
> perhaps did, I don't keep up with it) wrong for the longest of times
> was that it would completely lose its marbles if the system's locale
> was set to turkish. It would just fail to work. PHP identifiers are
> defined as case insensitive, and PHP implemented this by
> toLowerCase()ing everything, using the platform default encoding. This
> turns "FILE" into "fıle" which is not equal to "file", and thus
> running most PHP code on a machine configured with turkish locale
> breaks, if the PHP is in english. This is a famous example of the
> universally lamented "it works for me" attitude. So, yes, making this
> mistake has dire consequences.
>
> EXAMPLE 3: ascii hacks.
>
> To make matters worse still, lets think about why one would want to do
> a case insensitive comparison in the first place. Presumably because
> there's some user input of some sort that needs to be compared, and
> you don't want to bother the user with case sensitivity. However, if
> thats the aim, "case insensitive" is entirely the wrong idea. There
> are a bajillion systems around that can't deal with unicode. For
> example, if many 'name' forms don't even accept a dash in your last
> name, how many do you think accept a "ü" in it? And yet plenty of
> german folks are called "Müller". The 'fix' these poor saps have used
> for years and years is to write the canonical ascii alternative for
> their funky character. ß becomes ss, ü because ue, etcetera. I'm
> betting that if you intend for "JOE" and "joe" to be equal, then you
> should consider "müller" and "mueller" equal too.
>
> Whats really needed is a human inputted string comparator. Optimally
> speaking such a tool will first canonicalize each string, turning for
> example dotless i of any capitalization into a dotted lowercase i,
> turns ß into ss, ü into ue, etcetera, and only then compare the
> strings. I'm not even sure this can be done properly without knowing a
> locale, but you could do a lot better than
> String.CASE_INSENSITIVE_ORDER that way, and far better still versus
> comparing the .toLowerCase() versions of any two given strings.
>
> Java does in fact have something for this: java.text.Collator. This
> indeed does more or less what I just described, and it is in fact what
> one ought to be using. If I were you, I'd make a pmd plugin right now
> that checks for usage of locale-less tLC and tUC, as well as
> String.CASE_INSENSITIVE_ORDER, and flag em all as warnings.
>
> BONUS EXAMPLE: Lower case. Upper case. And Title case??.
>
> Astute folks may have observed that Character.toTitleCase() exists.
> What's that you ask? Well, in some languages, there are single
> characters that represent two 'sounds'. For example the 'dz'. That's
> one character in some languages. This is very similar to the germans
> who have enshrined the much used ß ligature into a unique character.
> However, unlike the ß which has pretty much lost all resemblance to
> the original character, and which can never appear at the beginning of
> a word, these characters are mostly just a kerned version of the
> original, and CAN appear at the beginning of the word. In unicode
> speak these are called digraphs. These have 3 and not 2 capitalization
> forms: all lowercase, all uppercase, and the first part of the digraph
> uppercased, the second part lowercased. This is what you'd use when
> you want a word with just the first letter capitalized, and you use
> the all-caps version only if you need an all-caps rendering of the
> word (i.e. rarely). This obscure little factoid actually was useful
> for me when writing lombok: The method that turns "foo" into "getFoo"
> will use toTitleCase() and not .toUpperCase().
>
> And there ends our trip round the world. I hope you enjoyed it. Though
> now you know: That feeling of despair when someone mentions i18n?
> You're entirely correct in feeling it. It's a pain in the tusch!

-- 
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en.

[The Java Posse] Re: Dick, that's not how you compare strings!

Reply via email to