\b{wb}

Karl Williamson Sat, 22 Aug 2015 13:13:47 -0700

The concept of \b in a regular expression meaning to match the boundarybetween a word and non-word was invented by Larry Wall, for the Perlprogramming language. This was before Unicode, and a word was definedas alphanumerics plus the underscore, which fit well with howidentifiers in that computer language (and many others) were defined.Essentially \b is defined to break between runs of word charactersversus runs of non-word characters.

The latest version of Perl 5 (recently released) has added \b{w} basedon Unicode's definition. The typical expectation of its programmers isthat it would be a drop-in replacement for the old \b, with much betterresults in parsing natural languages.

But it isn't such a replacement, creating some consternation, and themain reason is that, unlike \b, it treats the boundary between whitespace characters as a breaking opportunity, so that it doesn't createruns of them. Thus if you have two spaces after a full stop, it treatseach as an individual word.


My question is "Was this intentional, and if so, Why?"

TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Notethat this is different than \b alone, which corresponds to \w and \W."

And UAX29 says "adjacent spaces are collapsed to a single space" inintelligent cut and paste using the WB property.

\b{wb}

Reply via email to