Very interesting read, but I gotta ask about toUpperCase(Locale). You've soundly debunked the idea of using toLowerCase(Locale) as a reliable internationalizable way of comparing strings irrespective of capitalization. Am I wrong in considering toUpperCase(Locale) as a more suitable alternative? It seems your examples would lead one to believe so.
Alexey ________________________________ From: Reinier Zwitserloot <[email protected]> To: The Java Posse <[email protected]> Sent: Wed, August 4, 2010 10:07:54 AM Subject: [The Java Posse] Dick, that's not how you compare strings! This is a long post that is mostly a fun walk through a few different languages used on planet earth, showing why the entire concept of 'upper' and 'lower' case is mostly broken if one ascribes to the theory that english isn't the only language around. The summary, in case you don't want to read it all, boils down to this: (A) Use String.CASE_INSENSITIVE_ORDER, it's better than toLowerCase()ing both arguments, and (B) actually, don't use that either, use java.util.Collator. But, if you have a casual interest in learning something, by all means read on! >From one of the many scala threads, dick uses .toLowerCase() as a placeholder for case insensitive string comparison, arguing that for example this: list.sort(_.toLowerCase); is much simpler than the java alternative, which in Dick's example is e.g. an Anonymous inner class literal to supply an implementation of Comparator. That's a flawed example. The right way to do this in java would be: Collections.sort(list, String.CASE_INSENSITIVE_ORDER); but it gets much worse, because this just isn't how one should ever compare strings. I'll give you three examples: EXAMPLE 1: The Eszett character. In german, a certain ligature for two consecutive s characters has been used so much for so long it's considered its own unique character. The Eszett or 'Scharfes s': ß. It shows up all over the place in german, for example in the word street, which is straße in german. The Eszett is somewhat unique in that it has no capital form. Words don't start with double s so there has never been a reason to capitalize it. Instead, if you ask a german to write street in all caps, they'd write "STRASSE". Uhoh. That's not a reversible operation! The lowercase form of "STRASSE" is "strasse", and even knowing that the locale is german, you can't just guess that double-s *ALWAYS* becomes ß character. You'd need a dictionary. Arguably you're better off doing that conversion than not doing it if you know its german (more likely), but it won't be a sure thing. Nevertheless, trying to check if "straße" and "STRASSE" are two equal strings under case insensitive comparison SHOULD return true, and not false, but while that's in the cards, using .toLowerCase() as a proxy for case insensitive comparison is never going to get us that. Java screws this up, though. Java has chosen to adopt the unicode invention of the capital Eszett, which nobody in germany uses and is clearly a silly idea: You don't hack a living, breathing language to get around i18n problems. If that's feasible we might as well tell the world to stop whining and only use english when using machines. Even more dire, String.CASE_INSENSITIVE_ORDER does not in fact return '0' (indicating equality) when comparing straße and STRASSE, even though every single language that uses the ß (german. That's it!) says those two are case-insensitive equal. Nevertheless, the fact that java has a bug here does not excuse the brokenness of the 'toLowerCase() is an alternative for case insensitive comparison' meme. EXAMPLE 2: The turkish dotless/dotted i and the somewhat famous: PHP is completely broken on Turkish machines problem. The ß vs. SS issue indicates that toLowerCase() isn't a valid replacement for case insensitive comparison, but it gets worse. You can't, in fact, say anything whatsoever about casing, or case insensitive equality, without knowing the locale you're operating in. In turkish there's not one 'i' but two: The dotted i and the dotless i. Where this gets crazy is the capital forms of those. The capital form of 'i' is a _dotted_ capital I, and the capital form of dotless 'i' is our normal, familiar capital I, which doesn't have a dot in it. This even messes with kerning (fi is a common kerning where the tip of the curvey top of the f is joined with the i's dot. That's not appropriate in turkish, where that dot is important). This means "i" and "I" are not equal in case-insensitive comparison in the Turkish locale. .toLowerCase() comparison gets this wrong, of course. toLowerCase(new Locale("TR")) would actually get this right. String.CASE_INSENSITIVE_ORDER gets this wrong, because you can't give it a locale. For natural language comparison (and why are you upper/ lowering strings in the first place, if not because youre doing natural language comparison?), tLC(), tUC(), String.CASE_INSENSITIVE_ORDER - are all inherently broken. Java gets this a little right and offers Locale-based variants of tLC() and tUC(), as I hinted at earlier. That PHP we always like to bash on? One of the things PHP does (or perhaps did, I don't keep up with it) wrong for the longest of times was that it would completely lose its marbles if the system's locale was set to turkish. It would just fail to work. PHP identifiers are defined as case insensitive, and PHP implemented this by toLowerCase()ing everything, using the platform default encoding. This turns "FILE" into "fıle" which is not equal to "file", and thus running most PHP code on a machine configured with turkish locale breaks, if the PHP is in english. This is a famous example of the universally lamented "it works for me" attitude. So, yes, making this mistake has dire consequences. EXAMPLE 3: ascii hacks. To make matters worse still, lets think about why one would want to do a case insensitive comparison in the first place. Presumably because there's some user input of some sort that needs to be compared, and you don't want to bother the user with case sensitivity. However, if thats the aim, "case insensitive" is entirely the wrong idea. There are a bajillion systems around that can't deal with unicode. For example, if many 'name' forms don't even accept a dash in your last name, how many do you think accept a "ü" in it? And yet plenty of german folks are called "Müller". The 'fix' these poor saps have used for years and years is to write the canonical ascii alternative for their funky character. ß becomes ss, ü because ue, etcetera. I'm betting that if you intend for "JOE" and "joe" to be equal, then you should consider "müller" and "mueller" equal too. Whats really needed is a human inputted string comparator. Optimally speaking such a tool will first canonicalize each string, turning for example dotless i of any capitalization into a dotted lowercase i, turns ß into ss, ü into ue, etcetera, and only then compare the strings. I'm not even sure this can be done properly without knowing a locale, but you could do a lot better than String.CASE_INSENSITIVE_ORDER that way, and far better still versus comparing the .toLowerCase() versions of any two given strings. Java does in fact have something for this: java.text.Collator. This indeed does more or less what I just described, and it is in fact what one ought to be using. If I were you, I'd make a pmd plugin right now that checks for usage of locale-less tLC and tUC, as well as String.CASE_INSENSITIVE_ORDER, and flag em all as warnings. BONUS EXAMPLE: Lower case. Upper case. And Title case??. Astute folks may have observed that Character.toTitleCase() exists. What's that you ask? Well, in some languages, there are single characters that represent two 'sounds'. For example the 'dz'. That's one character in some languages. This is very similar to the germans who have enshrined the much used ß ligature into a unique character. However, unlike the ß which has pretty much lost all resemblance to the original character, and which can never appear at the beginning of a word, these characters are mostly just a kerned version of the original, and CAN appear at the beginning of the word. In unicode speak these are called digraphs. These have 3 and not 2 capitalization forms: all lowercase, all uppercase, and the first part of the digraph uppercased, the second part lowercased. This is what you'd use when you want a word with just the first letter capitalized, and you use the all-caps version only if you need an all-caps rendering of the word (i.e. rarely). This obscure little factoid actually was useful for me when writing lombok: The method that turns "foo" into "getFoo" will use toTitleCase() and not .toUpperCase(). And there ends our trip round the world. I hope you enjoyed it. Though now you know: That feeling of despair when someone mentions i18n? You're entirely correct in feeling it. It's a pain in the tusch! -- You received this message because you are subscribed to the Google Groups "The Java Posse" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/javaposse?hl=en. -- You received this message because you are subscribed to the Google Groups "The Java Posse" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/javaposse?hl=en.
