regex rewriting code (part 3 of 3)

Tom Christiansen Tue, 25 Jan 2011 11:32:09 -0800

Now I will discuss the more interesting of my two functions, the one that
handles charclass escapes such as those given in RL1.2a.  The particular
Level 1 place where this code is relevant is RL1.2a's Annex C Compatibility
Properties, RL 1.4 Simple Word Boundaries, and RL1.6 Line Boundaries.


I do not try to change the overloaded property names in \p{...} that Java
is using for something other than what Unicode says they must mean.  I only
change the single-letter backslash escapes, but for those I do implement all
the Standard Recommendations.  It is not too hard.

I could not satisfy the list of 11 required Unicode properties from RL1.2,
of which Java currently provides 3 only.  I did manage to cover all
backslash charclass escapes listed in RL1.2a.  I believe that apart from
the 8 (now 7) missing properties from RL1.2, between the previously
discussed function from part 2 and the one I describe below I address all
other concerns I have raised about Java's Level 1 conformance.

Again, I present this not as code I think you should necessarily use.  I
merely cite it as an existence proof that these corrections are prefectly
feasible, and for the most part, fairly easily done.  Sometimes just knowing
that something *can* be done makes all the difference.

I do not discuss matters of backwards compatibility.  There are many possible
approaches on that, and I will address them in a future letter.

CHARCLASS ESCAPES
=================

The function translates the following regex-only escapes into equivalent
constructs that allow Java to meet the RL1.2a requirements as I understand
those to be.  The first row change how existing Java patterns (mis)behave,
while the second row is new to Java:

     \s \S       \w \W       \b \B       \d \D

     \v \V       \h \H       \R          \X

Java's biggest problem is \w vs \b; Java's second biggest problem in this
area is \s. (Capitalize those as needed to produce the complements, about
which the same applies.)

I'll take these charclass escapes one set at a time.

For \d, I use the Standard Recommendation of \p{Decimal_Number}.
This of all the charclass escapes seems to have a sufficiently
useful alternate definition that one might wish to make some
allowance.  Indeed, Annex C already does so: under the POSIX
Compatibility column, it shows that one is allowed to use [0-9]
instead of this.

For \s, I use the strict Unicode White_Space definition as defined by
RL1.2.  Although required for Level 1 compliance, this is not a property
that Java currently has direct access to.  Indeed, Java currently has no
mechanism for detecting Unicode whitespace.  The following table of J(ava)
vs P(erl) results illustrates the problem:

                           Code point
             Regex    001A    0085    00A0    2029
                      J  P    J  P    J  P    J  P
                \s    1  1    0  1    0  1    0  1
               \pZ    0  0    0  0    1  1    1  1
            \p{Zs}    0  0    0  0    1  1    0  0
         \p{Space}    1  1    0  1    0  1    0  1
         \p{Blank}    0  0    0  0    0  1    0  0
    \p{Whitespace}    -  1    -  1    -  1    -  1
\p{javaWhitespace}    1  -    0  -    0  -    1  -
 \p{javaSpaceChar}    0  -    0  -    1  -    1  -

I therefore enumerate the small set directly:

    \s      
[\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

The \v and \h are the Unicode Vertical_Space and Horizontal_Space
properties, for programmer convenience and also compatibility
with Perl.  Those are

    \v      [\u000A-\u000D\u0085\u2028\u2029]

    \h      [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]

The complements -- \S, \V, and \H, follow directly from the
definitions given above.

For \w, I have taken the nearest possible approximation to
Annex C's definition, which is all of

    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}

I say nearest possible approximation, but what I have devised is input-
output equivalent to that definition.  That is, it produces zero false
negatives and zero false positives when checked against every single
possible valid Unicode code point.  This should suffice.

Now, normally, I would not be able to do this because I have no access to
the full Unicode Alphabetic property, which includes various but not
all Mark characters, and some circled letters.  Fortunately, the addition
of all Mark characters to \w removes the problem of distinguishing the
Alphabetic Marks from the non-Alphabetic ones.  I have even devised a
way to handle the circled letters.

My resulting translation of \w is

    \w  [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]

and so my translation of \W is the complement of that bracketed
character class:

    \W  [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]

It is far more reasonable to require ASCII-only people to write

    (\p{ASCII}\w)

if that is what they want than it is to make Unicode people write the long
version I give above when they want it!  But I have several different ideas
for how to appease the Backwards Compatibility Police.  More on that later.

The \R is to meet one of the two strong recommendations for Level 1
compliance, found under RL1.6 Line Boundaries.

The \X is to meet RL1.2a, and also to go part of the way toward meeting
RL2.2 Extended Grapheme Clusters.  It would be easy to Legacy Grapheme
Clusters exactly.  For that you would need merely replace \X with
(?:\PM\pM*+), which is easily enough done.

However, I wanted Extended Grapheme Clusters, which is *much* harder.
In fact, it is this hard:

    EGC =   ( CR LF )
          | ( Prepend*
              ( L+ | (L* ( ( V | LV ) V* | LVT ) T*) | T+ | [^ Control CR LF ] )
              ( Extend | SpacingMark )*
            )
          | .

Where the L, V, LV, LVT, and T bits are defined in terms of the Hangul
Syllable Type property.  I came very close, but absent full access to the
HST property values, it is not going to always work correctly on Korean.
It will work correctly on Western languages, however.  I will not show the
expansion here, as it has that Medusa-look property. :)  But it does work
except for border-cases in East Asian languages.

The way \b works is related to RL1.4 Simple Word Boundaries.  How this ties
into RL1.2a is a bit confusing, since the language on this in RL1.2a's
Annex C is less than perfectly clear.  However, the ambiguities disappear
when we consider supporting text from elsewhere in tr18.

Under the second, POSIX Compatible column of Annex C, it gives "n/a" for \b.
Under the first, Standard Recommendation column, for \b it gives "Default Word
Boundaries".  Then in the third, Comments column is says:

    If there is a requirement that \b align with \w, then it would use the
    approximation above instead. See [UAX29], also WordBreakTest.  Note
    that different functions are used for programming language identifier
    boundaries. See also [UAX31].

So we seem to be left wondering what the relationship is supposed to be
between \b and \w.  However, further down in 2.2 Extended Grapheme Clusters
we find this text:

    \b{w}   Zero-width match at a Unicode word boundary. Note that this
            is different than \b alone, which corresponds to \w and \W.
            See Annex C: Compatibility Properties.

That clearly asserts that "\b alone corresponds to \w and \W."  That is its
historical sense, and that is what I have implemented in the function that
replaces charclass escapes with something Java can handle.

Assuming that one shall use my replacement definition of \w as given above,
that makes \b and \B work correctly if you write them this way:

   Orig    Rewrite

    \b     (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))

    \B     (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))

I have a gigantic test suite in which I exhaustively tested all possible
combinations -- literally millions of them -- to verify that these were
equivalent statements.  Since I have proven that they are, I then replaced
the \w in them with the longer bracket character class given about.

    \b      
(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}ww\p{So}])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]))
    \B     
(?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]))

It may have been merely unreasonable to make people write the
replacements for \w, but it is simply beyond the pale to ask
that they on their own should write the replacements for \b.

And now at last they can in Java expect \b\w+\b to match
the string "élève".And in its entirely, no less.  

This concludes my letter discussing the code that brought
us to each other's attention in the first place.

--tom

regex rewriting code (part 3 of 3)

Reply via email to