In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/54bdcd8ec4c7b2111381943f4fdd4a07d3fe1bf9?hp=8f226aeeda55a51eee04feb4b605d30997d9b592>
- Log ----------------------------------------------------------------- commit 54bdcd8ec4c7b2111381943f4fdd4a07d3fe1bf9 Author: Karl Williamson <[email protected]> Date: Mon Mar 9 11:45:13 2015 -0600 perlrebackslash: Add, correct \b{} text This fleshes out documentation about this new feature M pod/perlrebackslash.pod commit febd1aee81db64f8e0eaa947896dada407bb7142 Author: Karl Williamson <[email protected]> Date: Mon Mar 9 11:43:35 2015 -0600 perlrebackslash: Nit M pod/perlrebackslash.pod commit b3a7a0153e8fcae963c59b681613d503f5f4a266 Author: Karl Williamson <[email protected]> Date: Mon Mar 9 11:41:31 2015 -0600 perluniprops: Add text about using with older Unicode releases This pod is generated by mktables. M lib/unicore/mktables ----------------------------------------------------------------------- Summary of changes: lib/unicore/mktables | 4 ++++ pod/perlrebackslash.pod | 48 ++++++++++++++++++++++++++++++++++-------------- 2 files changed, 38 insertions(+), 14 deletions(-) diff --git a/lib/unicore/mktables b/lib/unicore/mktables index e796649..722bc06 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -16185,6 +16185,10 @@ controlling lists contained in the program C<\$Config{privlib}>/F<unicore/mktables> and then re-compiling and installing. (C<\%Config> is available from the Config module). +Also, perl can be recompiled to operate on an earlier version of the Unicode +standard. Further information is at +C<\$Config{privlib}>/F<unicore/README.perl>. + =head1 Other information in the Unicode data base The Unicode data base is delivered in two different formats. The XML version diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 425299d..b99d803 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -298,7 +298,7 @@ beginning with a "0". =head3 Hexadecimal escapes Like octal escapes, there are two forms of hexadecimal escapes, but both start -with the same thing, C<\x>. This is followed by either exactly two hexadecimal +with the sequence C<\x>. This is followed by either exactly two hexadecimal digits forming a number, or a hexadecimal number of arbitrary length surrounded by curly braces. The hexadecimal number is the code point of the character you want to express. @@ -558,8 +558,10 @@ non-word characters nor for string ends. It may help to understand how \b really means (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)) \B really means (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)) -In contrast, C<\b{...}> may or may not match at the beginning and end of -the line depending on the boundary type (and C<\B{...}> never does). +In contrast, C<\b{...}> and C<\B{...}> may or may not match at the +beginning and end of the line, depending on the boundary type. These +implement the Unicode default boundaries, specified in +L<http://www.unicode.org/reports/tr29/>. The boundary types currently available are: =over @@ -579,25 +581,41 @@ natural language sentences. It gives good, but imperfect results. For example, it thinks that "Mr. Smith" is two sentences. More details are at L<http://www.unicode.org/reports/tr29/>. Note also that it thinks that anything matching L</\R> (except form feed and vertical tab) is a -sentence boundary. This works with word-processor text which line wraps +sentence boundary. C<\b{sb}> works with text designed for +word-processors which wrap lines automatically for display, but hard-coded line boundaries are considered to be essentially the ends of text blocks (paragraphs really), and hence -the ends of sententces. It doesn't well with text containing embedded -newlines, like the source text of the document you are reading. Such -text needs to be preprocessed to get rid of the line separators before -looking for sentence boundaries. Some people view this as a bug in the -Unicode standard. +the ends of sententces. C<\b{sb}> doesn't do well with text containing +embedded newlines, like the source text of the document you are reading. +Such text needs to be preprocessed to get rid of the line separators +before looking for sentence boundaries. Some people view this as a bug +in the Unicode standard. =item C<\b{wb}> This matches a Unicode "Word Boundary". This gives better (though not perfect) results for natural language processing than plain C<\b> (without braces) does. For example, it understands that apostrophes can -be in the middle of words and that parentheses aren't. More details -are at L<http://www.unicode.org/reports/tr29/>. +be in the middle of words and that parentheses aren't (see the examples +below). More details are at L<http://www.unicode.org/reports/tr29/>. =back +It is important to realize that these are default boundary definitions, +and that implementations may wish to tailor the results for particular +purposes and locales. Also note that Perl gives you the definitions +valid for the version of the Unicode Standard compiled into Perl. These +rules are not considered stable and have been somewhat more subject to +change than the rest of the Standard, and hence changing to a later Perl +version may give you a different Unicode version whose changes may not +be compatibile with what you coded for. If, necessary, you can +recompile Perl with an earlier version of the Unicode standard. More +information about that is in L<perluniprops/Unicode character properties +that are NOT accepted by Perl> + +Unicode defines a fourth boundary type, accessible through the +L<Unicode::LineBreak> module. + Mnemonic: I<b>oundary. =back @@ -621,10 +639,12 @@ Mnemonic: I<b>oundary. print $1; # Prints 'cat' } - print join "|", "He said, \"Do you care? (I don't).\"" - =~ m/ ( .+? \b{wb} ) /xg; + my $s = "He said, \"Is pi 3.14? (I'm not sure).\""; + print join("|", $s =~ m/ ( .+? \b ) /xg), "\n"; + print join("|", $s =~ m/ ( .+? \b{wb} ) /xg), "\n"; prints - He| |said|,| |"|Do| |you| |care|?| |(|I| |don't|)|.|" + He| |said|, "|Is| |pi| |3|.|14|? (|I|'|m| |not| |sure + He| |said|,| |"|Is| |pi| |3.14|?| |(|I'm| |not| |sure|)|.|" =head2 Misc -- Perl5 Master Repository
