> Thanks for looking into the Unicode support issues in Java RegEx. > Since you have been working on Unicode in the past decade, I'm sure > you understand that most of the issues you are pointing out here > belongs to the "Extended Unicode Support: Level 2" as documented in > UTS#18 Unicode Regular Expressions [2].
I don't know that "most" of my issues pertain to Level 2, although I haven't actually counted up what falls in what category. > Unfortunately the current Java RegEx implementation only > supports the "Basic Unicode Support: Level 1", Quite possibly you've done more work to make that statement true, but as far as I can tell, the current regex class does not provide that very most basic "Level 1" Unicode support specified in UTS#18. It does support some of the Level 1 features, but not all of them. Several are omitted, which I will draw attention to below. > as specified in Java RegEx > java.util.regex.Pattern API document [1]. > [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html Is the source for that available? If it were, I'm sure many questions I have I could easily answer myself. > [2] http://www.unicode.org/reports/tr18 Perhaps I'm misreading, but I do not believe that Java provides even basic Level 1 support for regexes as specified in that document. Sherman, you may have already added in the necessary functionality for Level 1 support, but I do not see that in the API you reference above. It is quite easy to tell whether an implementation meets the Level 1 requirements because under each of those 7 subsections, there is a very specific statement about what it takes to be considered to have met that requirement. These statements are of the form "RX.Y: ..." where X is 1 for Level 1, 2 for Level 2, etc; and where Y is the subsection. I quote from UTS#18: 0.2 Conformance The following describes the possible ways that an implementation can claim conformance to this technical standard. All syntax and API presented in this document is only for the purpose of illustration; there is absolutely no requirement to follow such syntax or API. Regular expression syntax varies widely: the features discussed here would need to be adapted to the syntax of the particular implementation. In general, the syntax in examples is similar to that of Perl Regular Expressions, but it may not be exactly the same. While the API examples generally follow Java style, it is again only for illustration. C0. An implementation claiming conformance to this specification at any Level shall identify the version of this specification and the version of the Unicode Standard. C1. An implementation claiming conformance to Level 1 of this specification shall meet the requirements described in the following sections: RL1.1 Hex Notation RL1.2 Properties RL1.2a Compatibility Properties RL1.3 Subtraction and Intersection RL1.4 Simple Word Boundaries RL1.5 Simple Loose Matches RL1.6 Line Boundaries RL1.7 Supplementary Code Points I'll now go through each of those individually. --tom