Whoops, my bad. "straße" to uppercase in java *IS* the correct "STRASSE", even without using Locale.GERMANY. I screwed up my encoding (d'oh)!
Of course, "STRASSE" to lowercase remains "strasse" regardless of locale, and "STRASSE" and "straße" aren't equal according to String.CASE_INSENSITIVE_ORDER, so, only a small win. Still. If you want to experiment, \u00DF is safer to play with then actually hitting alt+S (mac only). For those who care, some extra commentary on reddit, where someone figured this one out: http://www.reddit.com/r/programming/comments/cxb8h/case_insensitive_comparison_a_lot_more/ On Aug 4, 4:07 pm, Reinier Zwitserloot <[email protected]> wrote: > This is a long post that is mostly a fun walk through a few different > languages used on planet earth, showing why the entire concept of > 'upper' and 'lower' case is mostly broken if one ascribes to the > theory that english isn't the only language around. The summary, in > case you don't want to read it all, boils down to this: (A) Use > String.CASE_INSENSITIVE_ORDER, it's better than toLowerCase()ing both > arguments, and (B) actually, don't use that either, use > java.util.Collator. > > But, if you have a casual interest in learning something, by all means > read on! > > From one of the many scala threads, dick uses .toLowerCase() as a > placeholder for case insensitive string comparison, arguing that for > example this: > > list.sort(_.toLowerCase); > > is much simpler than the java alternative, which in Dick's example is > e.g. an Anonymous inner class literal to supply an implementation of > Comparator. > > That's a flawed example. The right way to do this in java would be: > > Collections.sort(list, String.CASE_INSENSITIVE_ORDER); > > but it gets much worse, because this just isn't how one should ever > compare strings. I'll give you three examples: > > EXAMPLE 1: The Eszett character. > > In german, a certain ligature for two consecutive s characters has > been used so much for so long it's considered its own unique > character. The Eszett or 'Scharfes s': ß. It shows up all over the > place in german, for example in the word street, which is straße in > german. The Eszett is somewhat unique in that it has no capital form. > Words don't start with double s so there has never been a reason to > capitalize it. Instead, if you ask a german to write street in all > caps, they'd write "STRASSE". > > Uhoh. That's not a reversible operation! The lowercase form of > "STRASSE" is "strasse", and even knowing that the locale is german, > you can't just guess that double-s *ALWAYS* becomes ß character. You'd > need a dictionary. Arguably you're better off doing that conversion > than not doing it if you know its german (more likely), but it won't > be a sure thing. Nevertheless, trying to check if "straße" and > "STRASSE" are two equal strings under case insensitive comparison > SHOULD return true, and not false, but while that's in the cards, > using .toLowerCase() as a proxy for case insensitive comparison is > never going to get us that. > > Java screws this up, though. Java has chosen to adopt the unicode > invention of the capital Eszett, which nobody in germany uses and is > clearly a silly idea: You don't hack a living, breathing language to > get around i18n problems. If that's feasible we might as well tell the > world to stop whining and only use english when using machines. Even > more dire, String.CASE_INSENSITIVE_ORDER does not in fact return > '0' (indicating equality) when comparing straße and STRASSE, even > though every single language that uses the ß (german. That's it!) says > those two are case-insensitive equal. Nevertheless, the fact that java > has a bug here does not excuse the brokenness of the 'toLowerCase() is > an alternative for case insensitive comparison' meme. > > EXAMPLE 2: The turkish dotless/dotted i and the somewhat famous: PHP > is completely broken on Turkish machines problem. > > The ß vs. SS issue indicates that toLowerCase() isn't a valid > replacement for case insensitive comparison, but it gets worse. You > can't, in fact, say anything whatsoever about casing, or case > insensitive equality, without knowing the locale you're operating in. > In turkish there's not one 'i' but two: The dotted i and the dotless > i. Where this gets crazy is the capital forms of those. The capital > form of 'i' is a _dotted_ capital I, and the capital form of dotless > 'i' is our normal, familiar capital I, which doesn't have a dot in it. > This even messes with kerning (fi is a common kerning where the tip of > the curvey top of the f is joined with the i's dot. That's not > appropriate in turkish, where that dot is important). This means "i" > and "I" are not equal in case-insensitive comparison in the Turkish > locale. .toLowerCase() comparison gets this wrong, of course. > toLowerCase(new Locale("TR")) would actually get this right. > String.CASE_INSENSITIVE_ORDER gets this wrong, because you can't give > it a locale. For natural language comparison (and why are you upper/ > lowering strings in the first place, if not because youre doing > natural language comparison?), tLC(), tUC(), > String.CASE_INSENSITIVE_ORDER - are all inherently broken. Java gets > this a little right and offers Locale-based variants of tLC() and > tUC(), as I hinted at earlier. > > That PHP we always like to bash on? One of the things PHP does (or > perhaps did, I don't keep up with it) wrong for the longest of times > was that it would completely lose its marbles if the system's locale > was set to turkish. It would just fail to work. PHP identifiers are > defined as case insensitive, and PHP implemented this by > toLowerCase()ing everything, using the platform default encoding. This > turns "FILE" into "fıle" which is not equal to "file", and thus > running most PHP code on a machine configured with turkish locale > breaks, if the PHP is in english. This is a famous example of the > universally lamented "it works for me" attitude. So, yes, making this > mistake has dire consequences. > > EXAMPLE 3: ascii hacks. > > To make matters worse still, lets think about why one would want to do > a case insensitive comparison in the first place. Presumably because > there's some user input of some sort that needs to be compared, and > you don't want to bother the user with case sensitivity. However, if > thats the aim, "case insensitive" is entirely the wrong idea. There > are a bajillion systems around that can't deal with unicode. For > example, if many 'name' forms don't even accept a dash in your last > name, how many do you think accept a "ü" in it? And yet plenty of > german folks are called "Müller". The 'fix' these poor saps have used > for years and years is to write the canonical ascii alternative for > their funky character. ß becomes ss, ü because ue, etcetera. I'm > betting that if you intend for "JOE" and "joe" to be equal, then you > should consider "müller" and "mueller" equal too. > > Whats really needed is a human inputted string comparator. Optimally > speaking such a tool will first canonicalize each string, turning for > example dotless i of any capitalization into a dotted lowercase i, > turns ß into ss, ü into ue, etcetera, and only then compare the > strings. I'm not even sure this can be done properly without knowing a > locale, but you could do a lot better than > String.CASE_INSENSITIVE_ORDER that way, and far better still versus > comparing the .toLowerCase() versions of any two given strings. > > Java does in fact have something for this: java.text.Collator. This > indeed does more or less what I just described, and it is in fact what > one ought to be using. If I were you, I'd make a pmd plugin right now > that checks for usage of locale-less tLC and tUC, as well as > String.CASE_INSENSITIVE_ORDER, and flag em all as warnings. > > BONUS EXAMPLE: Lower case. Upper case. And Title case??. > > Astute folks may have observed that Character.toTitleCase() exists. > What's that you ask? Well, in some languages, there are single > characters that represent two 'sounds'. For example the 'dz'. That's > one character in some languages. This is very similar to the germans > who have enshrined the much used ß ligature into a unique character. > However, unlike the ß which has pretty much lost all resemblance to > the original character, and which can never appear at the beginning of > a word, these characters are mostly just a kerned version of the > original, and CAN appear at the beginning of the word. In unicode > speak these are called digraphs. These have 3 and not 2 capitalization > forms: all lowercase, all uppercase, and the first part of the digraph > uppercased, the second part lowercased. This is what you'd use when > you want a word with just the first letter capitalized, and you use > the all-caps version only if you need an all-caps rendering of the > word (i.e. rarely). This obscure little factoid actually was useful > for me when writing lombok: The method that turns "foo" into "getFoo" > will use toTitleCase() and not .toUpperCase(). > > And there ends our trip round the world. I hope you enjoyed it. Though > now you know: That feeling of despair when someone mentions i18n? > You're entirely correct in feeling it. It's a pain in the tusch! -- You received this message because you are subscribed to the Google Groups "The Java Posse" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/javaposse?hl=en.
