Re: question regarding rules and bytes vs characters

Larry Wall Fri, 09 Jul 2004 15:55:02 -0700

On Tue, Jun 01, 2004 at 07:56:41AM +0200, Ph. Marek wrote:
: Hello everybody,
: 
: I'm about to learn myself perl6 (after using perl5 for some time).


I'm also trying to learn perl6 after using perl5 for some time.  :-)

: One of my first questions deals with regexes.
: 
: 
: I'd like to parse data of the form
:       Len: 15\n
:       (15 bytes data)\n
:       Len: 5\n
:       (5 bytes data)\n
:       \n
:       OtherTag: some value here\n
: and so on, where the data can (and will) be binary.
: 
: I'd try for something like
:       my $data_tag= rule { 
:               Len\: $len:=(\d) \n 
:               $data:=([:u0 .]<$len>)\n  # these are bytes
:       };
: 
: Is that correct?

Pretty close.  The way it's set up currently, $len is a reference
to a variable external to the rule, so $len is likely to fail under
stricture unless you've declared "my $len" somewhere.  To make the
variable automatically scope to the rule, you have to use $?len
these days.

: And furthermore is perl6 said to be unicode-ready.
: So I put the :u0-modifier in the data-regex; will that DWIM if I try to match 
: a unicode-string with that rule?

It should.  However (and this is a really big however), you'll have
to be very careful that something earlier hasn't converted one form
of Unicode to another on you.  For instance, if your string came in
as UTF-8, and your I/O layer translated it internally to UTF-32 or
some such, you're just completely hosed.  When you're working at the
bytes level, you must know the encoding of your string.

So the natural reaction is to open your I/O handle :raw to get binary
data into your string.  Then you try to match Unicode graphemes with [
:u2 . ] and discover that *that* doesn't work.  Which is obvious when
you consider that Perl has no way of knowing which Unicode encoding
the binary data is in, so it's gonna consider it to be something like
Latin-1 unless you tell it otherwise.  So you'll probably have to
cast the binary string to whatever its actual encoding is (potentially
lying about the binary parts, which we may or may not get away with,
depending on who validates the string when), or maybe we just need
to define rules like <utf16be_codepoint> and <utf8_grapheme> for use
under the :u0 regime.

: Is anything known about the internals of pattern matching whether the 
: hypothetical variables will consume (double) space?
: I'm asking because I imagine getting a tag like "Len: 200000000" and then 
: having problems with 256MB RAM. Matching shouldn't be a problem according to 
: apo 5 (see the chapter "RFC 093: Regex: Support for incremental pattern 
: matching") but I'll maybe have troubles using the matched data?

My understanding is that Parrot implements copy-on-write, so you should
be okay there.

: Thank you for all answers!

Even the late ones?  :-)

Larry

Re: question regarding rules and bytes vs characters

Reply via email to