Re: [The Java Posse] Dick, that's not how you compare strings!

A McDowell Fri, 06 Aug 2010 01:17:56 -0700

> Am I wrong in considering toUpperCase(Locale) as a more suitable
alternative?


It depends on the nature of your data and how you want your code to work
with regards to normalization.

This can return false: "É".equals("É");

This is because you can make a Latin Capital Letter E With Acute using the
single code point U+00C9 or the combining sequence U+0045 U+0301.

So, the comparison can be rewritten as: "\u00C9".equals("\u0045\u0301");

Have a look at java.text.Normalizer to convert between these forms.

You'll start scratching your head when you get to characters like mu and
micro. They might share a grapheme and look identical to a user, but one is
a Greek letter and the other is a mathematical symbol and they have separate
Unicode code points.

Another thing to consider is that Unicode is a work in progress (and is
likely to remain so for the forseable future). Even if all code points have
a consistent transformation to upper case now, there's no guarantee that
this will remain the case. By using the more specialized libraries, you can
be reasonably certain that they will be maintained along with the
development of the Unicode spec.

How likely any of this is going to be a practical problem depends on the
source of your data and which languages you want to support.


On Fri, Aug 6, 2010 at 12:48 AM, Alexey Zinger <[email protected]>wrote:

> Very interesting read, but I gotta ask about toUpperCase(Locale).  You've
> soundly debunked the idea of using toLowerCase(Locale) as a reliable
> internationalizable way of comparing strings irrespective of
> capitalization.  Am I wrong in considering toUpperCase(Locale) as a more
> suitable alternative?  It seems your examples would lead one to believe so.
>
> Alexey
>
>
> ------------------------------
> *From:* Reinier Zwitserloot <[email protected]>
> *To:* The Java Posse <[email protected]>
> *Sent:* Wed, August 4, 2010 10:07:54 AM
> *Subject:* [The Java Posse] Dick, that's not how you compare strings!
>
> This is a long post that is mostly a fun walk through a few different
> languages used on planet earth, showing why the entire concept of
> 'upper' and 'lower' case is mostly broken if one ascribes to the
> theory that english isn't the only language around. The summary, in
> case you don't want to read it all, boils down to this: (A) Use
> String.CASE_INSENSITIVE_ORDER, it's better than toLowerCase()ing both
> arguments, and (B) actually, don't use that either, use
> java.util.Collator.
>
> But, if you have a casual interest in learning something, by all means
> read on!
>
> From one of the many scala threads, dick uses .toLowerCase() as a
> placeholder for case insensitive string comparison, arguing that for
> example this:
>
> list.sort(_.toLowerCase);
>
> is much simpler than the java alternative, which in Dick's example is
> e.g. an Anonymous inner class literal to supply an implementation of
> Comparator.
>
> That's a flawed example. The right way to do this in java would be:
>
> Collections.sort(list, String.CASE_INSENSITIVE_ORDER);
>
> but it gets much worse, because this just isn't how one should ever
> compare strings. I'll give you three examples:
>
> EXAMPLE 1: The Eszett character.
>
> In german, a certain ligature for two consecutive s characters has
> been used so much for so long it's considered its own unique
> character. The Eszett or 'Scharfes s': ß. It shows up all over the
> place in german, for example in the word street, which is straße in
> german. The Eszett is somewhat unique in that it has no capital form.
> Words don't start with double s so there has never been a reason to
> capitalize it. Instead, if you ask a german to write street in all
> caps, they'd write "STRASSE".
>
> Uhoh. That's not a reversible operation! The lowercase form of
> "STRASSE" is "strasse", and even knowing that the locale is german,
> you can't just guess that double-s *ALWAYS* becomes ß character. You'd
> need a dictionary. Arguably you're better off doing that conversion
> than not doing it if you know its german (more likely), but it won't
> be a sure thing. Nevertheless, trying to check if "straße" and
> "STRASSE" are two equal strings under case insensitive comparison
> SHOULD return true, and not false, but while that's in the cards,
> using .toLowerCase() as a proxy for case insensitive comparison is
> never going to get us that.
>
> Java screws this up, though. Java has chosen to adopt the unicode
> invention of the capital Eszett, which nobody in germany uses and is
> clearly a silly idea: You don't hack a living, breathing language to
> get around i18n problems. If that's feasible we might as well tell the
> world to stop whining and only use english when using machines. Even
> more dire, String.CASE_INSENSITIVE_ORDER does not in fact return
> '0' (indicating equality) when comparing straße and STRASSE, even
> though every single language that uses the ß (german. That's it!) says
> those two are case-insensitive equal. Nevertheless, the fact that java
> has a bug here does not excuse the brokenness of the 'toLowerCase() is
> an alternative for case insensitive comparison' meme.
>
> EXAMPLE 2: The turkish dotless/dotted i and the somewhat famous: PHP
> is completely broken on Turkish machines problem.
>
> The ß vs. SS issue indicates that toLowerCase() isn't a valid
> replacement for case insensitive comparison, but it gets worse. You
> can't, in fact, say anything whatsoever about casing, or case
> insensitive equality, without knowing the locale you're operating in.
> In turkish there's not one 'i' but two: The dotted i and the dotless
> i. Where this gets crazy is the capital forms of those. The capital
> form of 'i' is a _dotted_ capital I, and the capital form of dotless
> 'i' is our normal, familiar capital I, which doesn't have a dot in it.
> This even messes with kerning (fi is a common kerning where the tip of
> the curvey top of the f is joined with the i's dot. That's not
> appropriate in turkish, where that dot is important). This means "i"
> and "I" are not equal in case-insensitive comparison in the Turkish
> locale. .toLowerCase() comparison gets this wrong, of course.
> toLowerCase(new Locale("TR")) would actually get this right.
> String.CASE_INSENSITIVE_ORDER gets this wrong, because you can't give
> it a locale. For natural language comparison (and why are you upper/
> lowering strings in the first place, if not because youre doing
> natural language comparison?), tLC(), tUC(),
> String.CASE_INSENSITIVE_ORDER - are all inherently broken. Java gets
> this a little right and offers Locale-based variants of tLC() and
> tUC(), as I hinted at earlier.
>
> That PHP we always like to bash on? One of the things PHP does (or
> perhaps did, I don't keep up with it) wrong for the longest of times
> was that it would completely lose its marbles if the system's locale
> was set to turkish. It would just fail to work. PHP identifiers are
> defined as case insensitive, and PHP implemented this by
> toLowerCase()ing everything, using the platform default encoding. This
> turns "FILE" into "fıle" which is not equal to "file", and thus
> running most PHP code on a machine configured with turkish locale
> breaks, if the PHP is in english. This is a famous example of the
> universally lamented "it works for me" attitude. So, yes, making this
> mistake has dire consequences.
>
> EXAMPLE 3: ascii hacks.
>
> To make matters worse still, lets think about why one would want to do
> a case insensitive comparison in the first place. Presumably because
> there's some user input of some sort that needs to be compared, and
> you don't want to bother the user with case sensitivity. However, if
> thats the aim, "case insensitive" is entirely the wrong idea. There
> are a bajillion systems around that can't deal with unicode. For
> example, if many 'name' forms don't even accept a dash in your last
> name, how many do you think accept a "ü" in it? And yet plenty of
> german folks are called "Müller". The 'fix' these poor saps have used
> for years and years is to write the canonical ascii alternative for
> their funky character. ß becomes ss, ü because ue, etcetera. I'm
> betting that if you intend for "JOE" and "joe" to be equal, then you
> should consider "müller" and "mueller" equal too.
>
> Whats really needed is a human inputted string comparator. Optimally
> speaking such a tool will first canonicalize each string, turning for
> example dotless i of any capitalization into a dotted lowercase i,
> turns ß into ss, ü into ue, etcetera, and only then compare the
> strings. I'm not even sure this can be done properly without knowing a
> locale, but you could do a lot better than
> String.CASE_INSENSITIVE_ORDER that way, and far better still versus
> comparing the .toLowerCase() versions of any two given strings.
>
> Java does in fact have something for this: java.text.Collator. This
> indeed does more or less what I just described, and it is in fact what
> one ought to be using. If I were you, I'd make a pmd plugin right now
> that checks for usage of locale-less tLC and tUC, as well as
> String.CASE_INSENSITIVE_ORDER, and flag em all as warnings.
>
> BONUS EXAMPLE: Lower case. Upper case. And Title case??.
>
> Astute folks may have observed that Character.toTitleCase() exists.
> What's that you ask? Well, in some languages, there are single
> characters that represent two 'sounds'. For example the 'dz'. That's
> one character in some languages. This is very similar to the germans
> who have enshrined the much used ß ligature into a unique character.
> However, unlike the ß which has pretty much lost all resemblance to
> the original character, and which can never appear at the beginning of
> a word, these characters are mostly just a kerned version of the
> original, and CAN appear at the beginning of the word. In unicode
> speak these are called digraphs. These have 3 and not 2 capitalization
> forms: all lowercase, all uppercase, and the first part of the digraph
> uppercased, the second part lowercased. This is what you'd use when
> you want a word with just the first letter capitalized, and you use
> the all-caps version only if you need an all-caps rendering of the
> word (i.e. rarely). This obscure little factoid actually was useful
> for me when writing lombok: The method that turns "foo" into "getFoo"
> will use toTitleCase() and not .toUpperCase().
>
> And there ends our trip round the world. I hope you enjoyed it. Though
> now you know: That feeling of despair when someone mentions i18n?
> You're entirely correct in feeling it. It's a pain in the tusch!
>
> --
> You received this message because you are subscribed to the Google Groups
> "The Java Posse" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to javaposse+
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/javaposse?hl=en.
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "The Java Posse" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<javaposse%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/javaposse?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups "The 
Java Posse" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/javaposse?hl=en.

Re: [The Java Posse] Dick, that's not how you compare strings!

Reply via email to