On Tue, Jun 01, 2004 at 07:56:41AM +0200, Ph. Marek wrote: : Hello everybody, : : I'm about to learn myself perl6 (after using perl5 for some time).
I'm also trying to learn perl6 after using perl5 for some time. :-) : One of my first questions deals with regexes. : : : I'd like to parse data of the form : Len: 15\n : (15 bytes data)\n : Len: 5\n : (5 bytes data)\n : \n : OtherTag: some value here\n : and so on, where the data can (and will) be binary. : : I'd try for something like : my $data_tag= rule { : Len\: $len:=(\d) \n : $data:=([:u0 .]<$len>)\n # these are bytes : }; : : Is that correct? Pretty close. The way it's set up currently, $len is a reference to a variable external to the rule, so $len is likely to fail under stricture unless you've declared "my $len" somewhere. To make the variable automatically scope to the rule, you have to use $?len these days. : And furthermore is perl6 said to be unicode-ready. : So I put the :u0-modifier in the data-regex; will that DWIM if I try to match : a unicode-string with that rule? It should. However (and this is a really big however), you'll have to be very careful that something earlier hasn't converted one form of Unicode to another on you. For instance, if your string came in as UTF-8, and your I/O layer translated it internally to UTF-32 or some such, you're just completely hosed. When you're working at the bytes level, you must know the encoding of your string. So the natural reaction is to open your I/O handle :raw to get binary data into your string. Then you try to match Unicode graphemes with [ :u2 . ] and discover that *that* doesn't work. Which is obvious when you consider that Perl has no way of knowing which Unicode encoding the binary data is in, so it's gonna consider it to be something like Latin-1 unless you tell it otherwise. So you'll probably have to cast the binary string to whatever its actual encoding is (potentially lying about the binary parts, which we may or may not get away with, depending on who validates the string when), or maybe we just need to define rules like <utf16be_codepoint> and <utf8_grapheme> for use under the :u0 regime. : Is anything known about the internals of pattern matching whether the : hypothetical variables will consume (double) space? : I'm asking because I imagine getting a tag like "Len: 200000000" and then : having problems with 256MB RAM. Matching shouldn't be a problem according to : apo 5 (see the chapter "RFC 093: Regex: Support for incremental pattern : matching") but I'll maybe have troubles using the matched data? My understanding is that Parrot implements copy-on-write, so you should be okay there. : Thank you for all answers! Even the late ones? :-) Larry