The meaning of \n and \N in rules

Patrick R. Michaud Fri, 04 Nov 2005 09:56:56 -0800

Synopsis 5 says that "C<\n> now matches a logical (platform independent) 
newline not just C<\012>".  But the devil is in the details, and I'm
wanting confirmation (or discussion) of the details on \n so I can
implement it in PGE...


Quick summary:  I'm thinking that \n should be defined as 
the equivalent of

    rule nl { [ \015\012 | <[\015\012\f\x85\x{2028}\x{2029}]> ]: }

Note the colon (:) at the end of the pattern, which means that the 
CRLF sequence (\x0d\x0a) will always be treated as a single newline 
for purposes of matching C<\n>.

Discussion:  The common newline characters in use today are
LF (\x0a), CRLF (\x0d\x0a), and CR (\x0d) depending on the
operating system involved.  The CRLF is the tricky one when
it comes to quantification, in particular, consider the 
following:

    "\012\012\012\012" ~~ / \n**{4} /        # matches (4 LFs)
    "\015\015\015\015" ~~ / \n**{4} /        # matches (4 CRs)
    "\015\012\015\012" ~~ / \n**{4} /        # ??? 

I'm of the opinion that the sequence "\015\012" should always
be treated as a single newline, in which case the last
expression above would not match because the target string contains
only two newlines.  But I want to check if others' interpretations
square with mine on this point (and if there's no consensus on it, 
we may need to pose the question to p6l for an official ruling).

The other characters in the definition of C<\n> above come from
Unicode, which gives the following as line terminators:
    
    LF - line feed - u000a
    CR - carriage return - u000d
    CR+LF - CR followed by LF
    FF - form feed - u000c
    NL - next line - u0085
    LS - line separator - u2028
    PS - paragraph separator - u2029

With this, the definition of \N is simply any character that
is not in the set [\012\015\x0c\x85\x{2028}\x{2029}].

Comments and feedback welcomed.

Pm

The meaning of \n and \N in rules

Reply via email to