Hi all, Just to add a little background here :). I'm Martijn (I help run the London JUG FWIW) and I ran across an answer from Tom on StackOverflow about Unicode issues in Java - I quickly deleted my answer!
It was one of _those_ answers which really impressed all of us Java developers on that thread, especially those who knew a little about Unicode (I don't really count myself as one of them!). So I asked Tom if he'd mind volunteering some of his time here as I knew there was some Unicode 6.0 work going on and as he has a PERL and Unicode background I thought he would be able to contribute in the discussions and work here (unlike someone like me who's eyes glaze over if I have to do anything more complicated than setting a character encoding). I met a few of the OpenJDK advocates at Devoxx and that's inspired me so I'm happy to try and help out Tom on the Java side where I can (or more importantly try to get enthusiastic volunteers from my JUG to help out ;p). Cheers, Martijn twitter - @karianna & @java7developer On Sat, Dec 11, 2010 at 5:38 PM, Tom Christiansen <tchr...@perl.com> wrote: > Good morning, > > I'm Tom Christiansen; some of you may know me from my work in the Perl > Community. I'm here at the urging of Martijn Verburg, who thought that my > recent discoveries should be heard by your group. > > I've been professionally programming for more than 25 years now, mostly in > C and Perl. I recently joined the biomedical text-mining group at the > University of Colorado, where the bulk of our code base is in Java. > > I've been responsible for working with large text corpora entirely in > Unicode. For example, one corpus comprises almost 200,000 papers and 11 > gigabytes, while another is a single file of 6 gigabytes. I'm not new to > Unicode, having worked with it a great deal over the last decade. > > Although most of our code base is in Java, we also have a considerable > portion of Perl code and some Python code, too. This code often first > tokenizes the input stream before moving on to more sophisticated semantic > processing. I was quite surprised to learn how differently Java treated > Unicode text than how the same text is treated by Perl and Python, even > using identical regular expressions. This has proved to be a significant > barrier to fully adopting Java for our Unicode work. > > This prompted me to make a comprehensive study of Unicode issues in Java, > focusing on regular expressions but also exploring other areas. I've > identified about two dozen individual areas that I feel deserve to be > looked at. These range from mismatches between documentation and behavior, > to unfortunate or inconvenient defaults (e.g. "documented not to work"), to > genuine bugs and international standards violations. > > Taken as a whole, these problem areas make Java a very difficult choice for > the sort of text processing my group needs to use it for. Surely many > others all around the world are in a similar position. > > I've searched the archives for this mailing list, and have found no mention > of these troubles either there, or indeed anywhere at all on the web. For > example: > > > http://www.google.com/search?client=opera&rls=en&q=site:http://mail.openjdk.java.net/pipermail/i18n-dev+unicode&sourceid=opera&ie=utf-8&oe=utf-8 > > I have working code that fixes what for us is the most egregious of these > problems: that regexes were unusable on Unicode. One fundamental bug is > that Java has misunderstood the connection between \b and \w regexes, so > that now a string like "élève" is not matched by the pattern "\b\w+\b" at > any point in the string. > > Other very serious problems include Java's unjustifiable demotion of legal > Unicode whitespace characters from the set of whitespace characters > (breaking tokenization), using Unicode property names in ways contrary to > what the spec says they do, and in general supporting no Unicode properties > any later than 3.0: even the critical Unicode 3.1 properties are ignored by > Java. These are very serious problems. Java almost cannot be said to > support Unicode--at least any Unicode release from the last ten > years--until these critical deficiencies are fixed. > > You can find a brief synopsis of these specific troubles as well as a link > to the Java code that fixes them here: > > > http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261 > > I don't by any means think this is the best way to go about this. It's > just a band-aide we needed quickly to allow us to move on with our work. > I'd like to offer it as a starting point for discussion of the issues that > prompted its creation. > > As I mentioned, I have a couple dozen different Java Unicode issues, and > this addresses just one or two of them. When I get time, I'll try to bring > up the others here in separate threads. > > If you could advise me how best to contribute to helping out here, I would > be grateful. > > Thank you, > > --tom >