Re: The meaning of \n and \N in rules

Larry Wall Fri, 04 Nov 2005 11:32:38 -0800

On Fri, Nov 04, 2005 at 09:53:07AM -0600, Patrick R. Michaud wrote:
: Synopsis 5 says that "C<\n> now matches a logical (platform independent) 
: newline not just C<\012>".  But the devil is in the details, and I'm
: wanting confirmation (or discussion) of the details on \n so I can
: implement it in PGE...
: 
: Quick summary:  I'm thinking that \n should be defined as 
: the equivalent of
: 
:     rule nl { [ \015\012 | <[\015\012\f\x85\x{2028}\x{2029}]> ]: }
: 
: Note the colon (:) at the end of the pattern, which means that the 
: CRLF sequence (\x0d\x0a) will always be treated as a single newline 
: for purposes of matching C<\n>.


That seems like a reasonable first approximation to me.  The main thing
will be to make sure we keep things consistent between rules and
filehandles that do autochomping.  One approach would be to make
autochomping always use rules to recognize newlines, but that might
well be something that a filehandle would want to optimize.

: Discussion:  The common newline characters in use today are
: LF (\x0a), CRLF (\x0d\x0a), and CR (\x0d) depending on the
: operating system involved.  The CRLF is the tricky one when
: it comes to quantification, in particular, consider the 
: following:
: 
:     "\012\012\012\012" ~~ / \n**{4} /        # matches (4 LFs)
:     "\015\015\015\015" ~~ / \n**{4} /        # matches (4 CRs)
:     "\015\012\015\012" ~~ / \n**{4} /        # ??? 

\n**{2}, presumably.

: I'm of the opinion that the sequence "\015\012" should always
: be treated as a single newline, in which case the last
: expression above would not match because the target string contains
: only two newlines.  But I want to check if others' interpretations
: square with mine on this point (and if there's no consensus on it, 
: we may need to pose the question to p6l for an official ruling).

Seems fine to me, unless it makes lots of programs run twice as slow,
which I tend to doubt.

: The other characters in the definition of C<\n> above come from
: Unicode, which gives the following as line terminators:

Well, one can distinguish separators from terminators, but yes.

:     LF - line feed - u000a
:     CR - carriage return - u000d
:     CR+LF - CR followed by LF
:     FF - form feed - u000c
:     NL - next line - u0085
:     LS - line separator - u2028
:     PS - paragraph separator - u2029
: 
: With this, the definition of \N is simply any character that
: is not in the set [\012\015\x0c\x85\x{2028}\x{2029}].

Er, yes.  Whatever a "character" is...

More specifically, \N means [<!before \n>.] where "." can mean
whatever the current lexical Unicode view makes it mean, anything
from byte to "language-dependent character" (though in any case capped
by the maximum abstraction level allowed by the string type itself).
To put it another way, a "." assumes the maximum allowed abstraction
level, where that level can be capped by either the lexical scope or
the string type.  The default for Perl 6's lexical cap is "grapheme",
but you can warp it up or down by explicit declaration.  The default
for strings depends on the type of the string.  A byte string allows
only byte meanings of ".", for instance, even in a lax lexical scope.

Larry

Re: The meaning of \n and \N in rules

Reply via email to