Re: Perl 5's non-greedy matching can be TOO greedy!
More generally, it seems to me that you're hung up on the description of "*?" as "shortest possible match". That's an ambiguous Yup, that's a bit confusing. It's really "start matching as soon as possible, and stop matching as soon as possible". (The usual greedy one is, of course, "keep matching as long as possible".) The initial invariant part, "start as soon as possible", is the de facto and de jure (at least POSIX 1003.2, but probably also Single Unix) definition, and therefore rather non-negotiable. It's like people who write /^.*fred/ instead of /.*fred/. They are forgetting something critical: where the Engine starts the serach. --tom
Re: Perl 5's non-greedy matching can be TOO greedy!
Have you thought it through NOW, on a purely semantic level (in isolation from implementation issues and historical precedent), I've said it before, and I'll say it again: you keep using the word "semantic", but I do not think you know what that word means. --tom
Re: RFC 308 (v1) Ban Perl hooks into regexes
I consider recursive regexps very useful: $a = qr{ (? [^()]+ ) | \( (??{ $a }) \) }; Yes, they're "useful", but darned tricky sometimes, and in ways other than simple regex-related stuff. For example, consider what happens if you do my $regex = qr{ (? [^()]+ ) | \( (??{ $regex }) \) }; That doesn't work due to differing scopings on either side of the assignment. And clearly a non-regex approach could be more legible for recursive parsing. --tom Visit our website at http://www.ubswarburg.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.
Re: \z vs \Z vs $
"TC" == Tom Christiansen [EMAIL PROTECTED] writes: Could you explain what the problem is? TC /$/ does not only match at the end of the string. TC It also matches one character fewer. This makes TC code like $path =~ /etc$/ "wrong". Sorry, I'm missing it. I know. On your "longest match", you are committing the classic error of thinking green more important than eagerness. It's not. This is unrelated to /m. Go back and read all the insanities we (mostly gbacon and your truly) went through to fix the 5.6 release's modules. People coded them *WRONG*. Wrong means incorrect behaviour. Sometimes this even leads to security foo. BOTTOM LINE: You cannot use /foo$/ to say "does the string end in `foo'?". You can't do that. You can't even use /s to fix it. It doesn't fix it. This is an annoying gotcha. Larry once said that he wished he had made \Z do what \z now does. One would like $ to (be able to) mean "ONLY AT END OF STRING". --tom EXAMPLE 1: --- /usr/local/lib/perl5/5.00554/File/Basename.pm Mon Jan 4 13:00:53 1999 +++ /usr/local/lib/perl5/5.6.0/File/Basename.pm Sun Mar 12 22:24:29 2000 @@ -37,10 +37,10 @@ "VMS", "MSDOS", "MacOS", "AmigaOS" or "MSWin32", the file specification syntax of that operating system is used in future calls to fileparse(), basename(), and dirname(). If it contains none of -these substrings, UNIX syntax is used. This pattern matching is +these substrings, Unix syntax is used. This pattern matching is case-insensitive. If you've selected VMS syntax, and the file specification you pass to one of these routines contains a "/", -they assume you are using UNIX emulation and apply the UNIX syntax +they assume you are using Unix emulation and apply the Unix syntax rules instead, for that function call only. If the argument passed to it contains one of the substrings "VMS", @@ -73,7 +73,7 @@ =head1 EXAMPLES -Using UNIX file syntax: +Using Unix file syntax: ($base,$path,$type) = fileparse('/virgil/aeneid/draft.book7', '\.book\d+'); @@ -102,7 +102,7 @@ The basename() routine returns the first element of the list produced by calling fileparse() with the same arguments, except that it always quotes metacharacters in the given suffixes. It is provided for -programmer compatibility with the UNIX shell command basename(1). +programmer compatibility with the Unix shell command basename(1). =item Cdirname @@ -111,8 +111,8 @@ second element of the list produced by calling fileparse() with the same input file specification. (Under VMS, if there is no directory information in the input file specification, then the current default device and -directory are returned.) When using UNIX or MSDOS syntax, the return -value conforms to the behavior of the UNIX shell command dirname(1). This +directory are returned.) When using Unix or MSDOS syntax, the return +value conforms to the behavior of the Unix shell command dirname(1). This is usually the same as the behavior of fileparse(), but differs in some cases. For example, for the input file specification Flib/, fileparse() considers the directory name to be Flib/, while dirname() considers the @@ -124,12 +124,22 @@ ## use strict; -use re 'taint'; +# A bit of juggling to insure that Cuse re 'taint'; always works, since +# File::Basename is used during the Perl build, when the re extension may +# not be available. +BEGIN { + unless (eval { require re; }) +{ eval ' sub re::import { $^H |= 0x0010; } ' } + import re 'taint'; +} + + +use 5.005_64; +our(@ISA, @EXPORT, $VERSION, $Fileparse_fstype, $Fileparse_igncase); require Exporter; @ISA = qw(Exporter); @EXPORT = qw(fileparse fileparse_set_fstype basename dirname); -use vars qw($VERSION $Fileparse_fstype $Fileparse_igncase); $VERSION = "2.6"; @@ -162,23 +172,23 @@ if ($fstype =~ /^VMS/i) { if ($fullname =~ m#/#) { $fstype = '' } # We're doing Unix emulation else { - ($dirpath,$basename) = ($fullname =~ /^(.*[:\]])?(.*)/); + ($dirpath,$basename) = ($fullname =~ /^(.*[:\]])?(.*)/s); $dirpath ||= ''; # should always be defined } } if ($fstype =~ /^MS(DOS|Win32)/i) { -($dirpath,$basename) = ($fullname =~ /^((?:.*[:\\\/])?)(.*)/); -$dirpath .= '.\\' unless $dirpath =~ /[\\\/]$/; +($dirpath,$basename) = ($fullname =~ /^((?:.*[:\\\/])?)(.*)/s); +$dirpath .= '.\\' unless $dirpath =~ /[\\\/]\z/; } - elsif ($fstype =~ /^MacOS/i) { -($dirpath,$basename) = ($fullname =~ /^(.*:)?(.*)/); + elsif ($fstype =~ /^MacOS/si) { +($dirpath,$basename) = ($fullname =~ /^(.*:)?(.*)/s); } elsif ($fstype =~ /^AmigaOS/i) { -($dirpath,$basename) = ($fullname =~ /(.*[:\/])?(.*)/); +($dirpath,$basename) = ($fullname =~ /(.*[:\/])?(.*)/s); $dirpath = './' unless $dirpath; } e
Re: \z vs \Z vs $
That was my second thought. I kinda like it, because //s would have two effects: + let . match a newline too (current) + let /$/ NOT accept a trailing newline (new) Don't forget /s's other meaning. --tom
\z vs \Z vs $
What can be done to make $ work "better", so we don't have to make people use /foo\z/ to mean /foo$/? They'll keep writing the $ for things that probably oughtn't abide optional newlines. Remember that /$/ really means /(?=\n?\z)/. And likewise with \Z. --tom
Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))
I am working on an RFC to allow boolean logic ( and || and !) to apply a number of patterns to the same substring to allow easier mining of information out of such constructs. What, you don't like: :-) $pattern = $conjunction eq "AND" ? join('' = map { "(?=.*$_)" } @patterns) | join("|" =@patterns); --tom
Re: copying and s/// (was Re: Overlapping RFCs 135 138 164)
Uri Guttman wrote: TC ($this = $that) =~ s/foo/bar/; TC for (@these = @those) { s/foo/bar/ } TC You can't really do those in one step without it. RFC 164 v2 has a new syntax that lets you do the above or, if you want: $this = s/foo/bar/, $that; @these = s/foo/bar/, @those; Those really aren't any more obvious to the reader than what we already have. Less so, in fact, since you can understand what the current ones are doing based on simple operators and precedences. --tom
Re: RFC 110 (v3) counting matches
That empty list to force the proper context irks me. How about a modifier to the RE that forces it (this would solve the "counting matches" problem too). $string =~ m{ (\d\d) - (\d\d) - (\d\d) (?{ push @dates, makedate($1,$2,$3) }) }gxl; $count = $string =~ m/foo/gl; # always list context The reason why not is because you're adding a special case hack to one particular place, rather than promoting a general mechanism that can be everywhere. Tell me: which is better and why. 1) A regex switch to specify scalar context, as in a mythical /r: push(@got, /bar/r) 2) A general mechanism, say for example, "scalar": push(@got, scalar /bar/) Obviously the "scalar" is better, because it does not require that a new switch be learnt, nor is its use restricted to pattern matching. Furthermore, it's inarguably more mnemonic for the sense of "match this scalarishly". Likewise, to force list context (a far less common operation, mind you), it is a bad idea to have what amounts to a special argument to just one function to this. What happens to the next function you want to do this to? How about if I want to force getpwnam() into list context and get back a scalar result? $count = getpwnam("tchrist")/l; $count = getpwnam("tchrist", LIST); $count = getpwnam("tchrist")-as_list; All of those, frankly, suck. This is much better: $count = () = getpwnam("tchrist"); It's better because * You don't have to invent anything new, whether syntactically or mnemonically. The sucky solution all require modification of Perl's very syntax. With the list assignment, you just need to learn how to use what you *already have*. I could say as much for (?{...}). Think how many of the suggestions on these lists can be dealt with simply through using existing features that the suggesting party was unaware of. * It's a general mechanism that isn't tailored for this particular function call. Special-purpose solutions are often inferior to general-purpose ones, because the latter are more likely to be creatively usable in a fashion unforeseen by the author. * What could possibly be more intuitive for the action of acting as though one were assigning to a list than doing that very thing itself? Since () is the canonical list (it's empty, after all), this follows directly and requires on special knowledge whatsoever. --tom
Re: RFC 110 (v2) counting matches
If we want to use uppercase, make these unique as well. That gives us many more combinations, and is not necessarily confusing: m//f - fast match m//F - first match m//i - case-insentitive m//I - ignore whitespace And so on. This seems like a much more productive use, otherwise we're just wasting characters. Larry's on record as preferring not to have us going down the road of using distinct upper and lower case regex switches. The distance between //c and //C, say, is far too narrow. --tom
Re: RFC 165 (v1) Allow Varibles in tr///
tr///e is the same as s///g: tr/$foo/$bar/e == s/$foo/$bar/g I suggest you read up on tr///, sir. You are completely wrong. --tom
Re: RFC 110 (v3) counting matches
p.s. Has anybody already suggested that we ought to have a nicer solution to execute perl code inside a string, replacing "${\(...)}" and "@{[...]}", which also won't ever win a beauty contest? Oops, wrong mailing list. The first one doesn't work, and never did. You want @{[]} and @{[scalar ]} instead. "Doesn't work"? print "The sum of 1 + 2 is ${\(1+2)}.\n"; -- The sum of 1 + 2 is 3. I'm surprised your wouldn't have known this. The principle is the same: "${...}" expects a scalar reference inside the block, and '\' provides one. Of course, there shouldn't be a real multi-element list inside the parens, but just one scalar. And often, the parens aren't needed. I'm surprised that you still don't understand. Notice what I showed you for the replacement above: @{[scalar ]}. Using ${\(...)} doesn't work in the sense that contrary to popular belief, it fails to provide a scalar context to the contents of those parens. Thus ${ \( fn() ) } is still calling fn() in list context, not scalar context. Witness: sub fn { sprintf "called in %s context", wantarray ? "list" : "scalar" } print "Test 1: "; print "@{ [fn()] }\n"; print "Test 2: "; print "${ \(fn()) }\n"; print "Test 3: "; print "@{ [scalar fn()] }\n"; That, when executed, yields: Test 1: called in list context Test 2: called in list context Test 3: called in scalar context *That's* why test 2 "doesn't work". --tom
Re: Overlapping RFCs 135 138 164
($foo = $bar) =~ s/x/y/; will never make much sense to me. What about these, which are much the same thing in that they all use the lvaluability of assignment: chomp($line = STDIN); ($foo = $bar) += 10; ($foo += 3) *= 2; func($diddle_me = $protect_me); $n = select($rout=$rin, $wout=$win, $eout=$ein, 2.5); --tom
Re: Overlapping RFCs 135 138 164
What about these, which are much the same thing in that they all use the lvaluability of assignment: And don't forget: for (@new = @old) { s/foo/bar/ } --tom
Re: RFC 110 (v3) counting matches
Have you ever wanted to count the number of matches of a patten? s///g returns the number of matches it finds. m//g just returns 1 for matching. Counts can be made using s//$/g but this is wastefull, or by putting some counting loop round a m//g. But this all seams rather messy. It's really much easier than all that: $count = () = $string =~ /pattern/g; --tom
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
Simple solution. If you want to require formats such as m/.../ (which I actually think is a good idea), then make it part of -w, -W, -ww, or -WW, which would be a perl6 enhancement of strictness. That's like having "use strict" enable mandatory perlstyle compliance checks, and rejecting the program otherwise. Doesn't seem sensible. --tom
Re: RFC 158 (v1) Regular Expression Special Variables
those early perl3 scripts by lwall floating around in /etc were poorly written. i am glad they are finally out of the distribution. Those weren't the scripts I was thinking about, and it is *NOT* ipso facto true that something which uses $ or $` is poorly written. --tom
Re: RFC 145 (v2) Brace-matching for Perl Regular Expressions
All in all, though, you're right that neither set of features is particularly well-known/used outside of p5p followers. At least from what I've seen. Virtually every person I've worked with since 5.6 came out has been surprised and amazed at the REx eval stuff. The completely reworked regex chapter in Camel III explains and demos all the new 5.6 features. I do not believe they will long remain the Cabal's secret. --tom
Re: RFC 144 (v1) Behavior of empty regex should be simple
I propose that this 'last successful match' behavior be discarded entirely, and that an empty pattern always match the empty string. I don't see a consideration for simply s/successful// above, which has also been talked about. Thas would also match expected usage based upon existing editors. --tom
Re: RFC 150 (v1) Extend regex syntax to provide for return of a hash of matched subpatterns
This is useful in that it would stop being number dependent. For example, you can't now safely say /$var (foo) \1/ and guarantee for arbitrary contents of $var that your you have the right number backref anymore. If I recall correctly, the Python folks addressed this. One might check that. --tom