The concept of \b in a regular expression meaning to match the boundary between a word and non-word was invented by Larry Wall, for the Perl programming language. This was before Unicode, and a word was defined as alphanumerics plus the underscore, which fit well with how identifiers in that computer language (and many others) were defined. Essentially \b is defined to break between runs of word characters versus runs of non-word characters.

The latest version of Perl 5 (recently released) has added \b{w} based on Unicode's definition. The typical expectation of its programmers is that it would be a drop-in replacement for the old \b, with much better results in parsing natural languages.

But it isn't such a replacement, creating some consternation, and the main reason is that, unlike \b, it treats the boundary between white space characters as a breaking opportunity, so that it doesn't create runs of them. Thus if you have two spaces after a full stop, it treats each as an individual word.

My question is "Was this intentional, and if so, Why?"

TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Note that this is different than \b alone, which corresponds to \w and \W."

And UAX29 says "adjacent spaces are collapsed to a single space" in intelligent cut and paste using the WB property.

  • \b{wb} Karl Williamson

Reply via email to