Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Tom Christiansen

 More generally, it seems to me that you're hung up on the description 
 of "*?" as "shortest possible match".  That's an ambiguous 

Yup, that's a bit confusing.  It's really "start matching as soon as
possible, and stop matching as soon as possible".  (The usual greedy
one is, of course, "keep matching as long as possible".)  The initial
invariant part, "start as soon as possible", is the de facto and de
jure (at least POSIX 1003.2, but probably also Single Unix)
definition, and therefore rather non-negotiable.

It's like people who write /^.*fred/ instead of /.*fred/.  They
are forgetting something critical: where the Engine starts the serach.

--tom



Re: Perl 5's non-greedy matching can be TOO greedy!

2000-12-15 Thread Tom Christiansen

Have you thought it through NOW, on a purely semantic level (in isolation
from implementation issues and historical precedent), 

I've said it before, and I'll say it again: you keep using 
the word "semantic", but I do not think you know what that word means.

--tom



Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-28 Thread Tom Christiansen

I consider recursive regexps very useful:

 $a = qr{ (? [^()]+ ) | \( (??{ $a }) \) };

Yes, they're "useful", but darned tricky sometimes, and in
ways other than simple regex-related stuff.  For example,
consider what happens if you do

my $regex = qr{ (? [^()]+ ) | \( (??{ $regex }) \) };

That doesn't work due to differing scopings on either side
of the assignment.  And clearly a non-regex approach could
be more legible for recursive parsing.

--tom

Visit our website at http://www.ubswarburg.com

This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.




Re: \z vs \Z vs $

2000-09-20 Thread Tom Christiansen

 "TC" == Tom Christiansen [EMAIL PROTECTED] writes:

 Could you explain what the problem is?

TC /$/ does not only match at the end of the string.
TC It also matches one character fewer.  This makes
TC code like $path =~ /etc$/ "wrong".

Sorry, I'm missing it.

I know.  

On your "longest match", you are committing the classic error of thinking
green more important than eagerness.  It's not.

This is unrelated to /m.

Go back and read all the insanities we (mostly gbacon and your
truly) went through to fix the 5.6 release's modules.  People coded
them *WRONG*.  Wrong means incorrect behaviour.  Sometimes this
even leads to security foo.

BOTTOM LINE: You cannot use /foo$/ to say "does the string end in `foo'?".
You can't do that.  You can't even use /s to fix it.  It doesn't fix it.

This is an annoying gotcha.  Larry once said that he wished he had made  \Z
do what \z now does.  One would like $ to (be able to) mean "ONLY AT END OF
STRING".

--tom

EXAMPLE 1:

--- /usr/local/lib/perl5/5.00554/File/Basename.pm   Mon Jan  4 13:00:53 1999
+++ /usr/local/lib/perl5/5.6.0/File/Basename.pm Sun Mar 12 22:24:29 2000
@@ -37,10 +37,10 @@
 "VMS", "MSDOS", "MacOS", "AmigaOS" or "MSWin32", the file specification 
 syntax of that operating system is used in future calls to 
 fileparse(), basename(), and dirname().  If it contains none of
-these substrings, UNIX syntax is used.  This pattern matching is
+these substrings, Unix syntax is used.  This pattern matching is
 case-insensitive.  If you've selected VMS syntax, and the file
 specification you pass to one of these routines contains a "/",
-they assume you are using UNIX emulation and apply the UNIX syntax
+they assume you are using Unix emulation and apply the Unix syntax
 rules instead, for that function call only.
 
 If the argument passed to it contains one of the substrings "VMS",
@@ -73,7 +73,7 @@
 
 =head1 EXAMPLES
 
-Using UNIX file syntax:
+Using Unix file syntax:
 
 ($base,$path,$type) = fileparse('/virgil/aeneid/draft.book7',
'\.book\d+');
@@ -102,7 +102,7 @@
 The basename() routine returns the first element of the list produced
 by calling fileparse() with the same arguments, except that it always
 quotes metacharacters in the given suffixes.  It is provided for
-programmer compatibility with the UNIX shell command basename(1).
+programmer compatibility with the Unix shell command basename(1).
 
 =item Cdirname
 
@@ -111,8 +111,8 @@
 second element of the list produced by calling fileparse() with the same
 input file specification.  (Under VMS, if there is no directory information
 in the input file specification, then the current default device and
-directory are returned.)  When using UNIX or MSDOS syntax, the return
-value conforms to the behavior of the UNIX shell command dirname(1).  This
+directory are returned.)  When using Unix or MSDOS syntax, the return
+value conforms to the behavior of the Unix shell command dirname(1).  This
 is usually the same as the behavior of fileparse(), but differs in some
 cases.  For example, for the input file specification Flib/, fileparse()
 considers the directory name to be Flib/, while dirname() considers the
@@ -124,12 +124,22 @@
 
 
 ## use strict;
-use re 'taint';
+# A bit of juggling to insure that Cuse re 'taint'; always works, since
+# File::Basename is used during the Perl build, when the re extension may
+# not be available.
+BEGIN {
+  unless (eval { require re; })
+{ eval ' sub re::import { $^H |= 0x0010; } ' }
+  import re 'taint';
+}
+
+
 
+use 5.005_64;
+our(@ISA, @EXPORT, $VERSION, $Fileparse_fstype, $Fileparse_igncase);
 require Exporter;
 @ISA = qw(Exporter);
 @EXPORT = qw(fileparse fileparse_set_fstype basename dirname);
-use vars qw($VERSION $Fileparse_fstype $Fileparse_igncase);
 $VERSION = "2.6";
 
 
@@ -162,23 +172,23 @@
   if ($fstype =~ /^VMS/i) {
 if ($fullname =~ m#/#) { $fstype = '' }  # We're doing Unix emulation
 else {
-  ($dirpath,$basename) = ($fullname =~ /^(.*[:\]])?(.*)/);
+  ($dirpath,$basename) = ($fullname =~ /^(.*[:\]])?(.*)/s);
   $dirpath ||= '';  # should always be defined
 }
   }
   if ($fstype =~ /^MS(DOS|Win32)/i) {
-($dirpath,$basename) = ($fullname =~ /^((?:.*[:\\\/])?)(.*)/);
-$dirpath .= '.\\' unless $dirpath =~ /[\\\/]$/;
+($dirpath,$basename) = ($fullname =~ /^((?:.*[:\\\/])?)(.*)/s);
+$dirpath .= '.\\' unless $dirpath =~ /[\\\/]\z/;
   }
-  elsif ($fstype =~ /^MacOS/i) {
-($dirpath,$basename) = ($fullname =~ /^(.*:)?(.*)/);
+  elsif ($fstype =~ /^MacOS/si) {
+($dirpath,$basename) = ($fullname =~ /^(.*:)?(.*)/s);
   }
   elsif ($fstype =~ /^AmigaOS/i) {
-($dirpath,$basename) = ($fullname =~ /(.*[:\/])?(.*)/);
+($dirpath,$basename) = ($fullname =~ /(.*[:\/])?(.*)/s);
 $dirpath = './' unless $dirpath;
   }
   e

Re: \z vs \Z vs $

2000-09-20 Thread Tom Christiansen

That was my second thought. I kinda like it, because //s would have two
effects:

 + let . match a newline too (current)

 + let /$/ NOT accept a trailing newline (new)

Don't forget /s's other meaning.

--tom



\z vs \Z vs $

2000-09-19 Thread Tom Christiansen

What can be done to make $ work "better", so we don't have to
make people use /foo\z/ to mean /foo$/?  They'll keep writing
the $ for things that probably oughtn't abide optional newlines.

Remember that /$/ really means /(?=\n?\z)/. And likewise with \Z.

--tom



Re: XML/HTML-specific ? and ? operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Tom Christiansen

I am working on an RFC
to allow boolean logic (  and || and !) to apply a number of patterns to
the same substring to allow easier mining of information out of such
constructs. 

What, you don't like: :-)

$pattern = $conjunction eq "AND"
? join(''  = map { "(?=.*$_)" } @patterns)
| join("|" =@patterns);

--tom



Re: copying and s/// (was Re: Overlapping RFCs 135 138 164)

2000-08-30 Thread Tom Christiansen

Uri Guttman wrote:
 
   TC ($this = $that) =~ s/foo/bar/;
   TC for (@these = @those) { s/foo/bar/ }
 
   TC You can't really do those in one step without it.

RFC 164 v2 has a new syntax that lets you do the above or, if you want:

   $this = s/foo/bar/, $that;
   @these = s/foo/bar/, @those;

Those really aren't any more obvious to the reader than what we
already have.  Less so, in fact, since you can understand what the
current ones are doing based on simple operators and precedences.

--tom



Re: RFC 110 (v3) counting matches

2000-08-29 Thread Tom Christiansen

That empty list to force the proper context irks me.  How about a
modifier to the RE that forces it (this would solve the "counting matches"
problem too).

   $string =~ m{
   (\d\d) - (\d\d) - (\d\d)
   (?{ push @dates, makedate($1,$2,$3) })
   }gxl;

   $count = $string =~ m/foo/gl;   # always list context

The reason why not is because you're adding a special case hack to 
one particular place, rather than promoting a general mechanism
that can be everywhere.  

Tell me: which is better and why.

1) A regex switch to specify scalar context, as in a mythical /r:

push(@got, /bar/r)

2) A general mechanism, say for example, "scalar":

push(@got, scalar /bar/)

Obviously the "scalar" is better, because it does not require that
a new switch be learnt, nor is its use restricted to pattern matching.
Furthermore, it's inarguably more mnemonic for the sense of "match this
scalarishly".

Likewise, to force list context (a far less common operation, mind
you), it is a bad idea to have what amounts to a special argument
to just one function to this.  What happens to the next function you
want to do this to?  How about if I want to force getpwnam() into list
context and get back a scalar result?

$count = getpwnam("tchrist")/l;
$count = getpwnam("tchrist", LIST);
$count = getpwnam("tchrist")-as_list;

All of those, frankly, suck.  This is much better:

$count = () = getpwnam("tchrist");

It's better because 

  * You don't have to invent anything new, whether syntactically
or mnemonically.  The sucky solution all require modification
of Perl's very syntax.  With the list assignment, you just need
to learn how to use what you *already have*.  I could say as
much for (?{...}).  Think how many of the suggestions on these
lists can be dealt with simply through using existing features
that the suggesting party was unaware of.

  * It's a general mechanism that isn't tailored for this particular
function call.  Special-purpose solutions are often inferior
to general-purpose ones, because the latter are more likely to 
be creatively usable in a fashion unforeseen by the author.

  * What could possibly be more intuitive for the action of acting
as though one were assigning to a list than doing that very
thing itself?  Since () is the canonical list (it's empty, after
all), this follows directly and requires on special knowledge
whatsoever.

--tom



Re: RFC 110 (v2) counting matches

2000-08-29 Thread Tom Christiansen

If we want to use uppercase, make these unique as well. That gives us
many more combinations, and is not necessarily confusing:

   m//f  -  fast match
   m//F  -  first match
   m//i  -  case-insentitive
   m//I  -  ignore whitespace
   
And so on. This seems like a much more productive use, otherwise we're
just wasting characters.

Larry's on record as preferring not to have us going down the road
of using distinct upper and lower case regex switches.  The distance
between //c and //C, say, is far too narrow.

--tom



Re: RFC 165 (v1) Allow Varibles in tr///

2000-08-29 Thread Tom Christiansen

tr///e is the same as s///g:

tr/$foo/$bar/e  ==  s/$foo/$bar/g

I suggest you read up on tr///, sir.  You are completely wrong.

--tom



Re: RFC 110 (v3) counting matches

2000-08-29 Thread Tom Christiansen

p.s. Has anybody already suggested that we ought to have a nicer
solution to execute perl code inside a string, replacing "${\(...)}" and
"@{[...]}", which also won't ever win a beauty contest?  Oops, wrong
mailing list.

The first one doesn't work, and never did.  You want 
@{[]} and @{[scalar ]} instead.

"Doesn't work"?

   print "The sum of 1 + 2 is ${\(1+2)}.\n";
--
   The sum of 1 + 2 is 3.

I'm surprised your wouldn't have known this. The principle is the same:
"${...}" expects a scalar reference inside the block, and '\' provides
one. Of course, there shouldn't be a real multi-element list inside the
parens, but just one scalar. And often, the parens aren't needed.

I'm surprised that you still don't understand.  Notice what I showed
you for the replacement above: @{[scalar ]}.

Using ${\(...)} doesn't work in the sense that contrary to popular
belief, it fails to provide a scalar context to the contents of
those parens.  Thus ${ \( fn() ) } is still calling fn() in list
context, not scalar context.  Witness:

sub fn { sprintf "called in %s context", wantarray ? "list" : "scalar" } 

print "Test 1: ";
print "@{ [fn()] }\n";

print "Test 2: ";
print "${ \(fn()) }\n";

print "Test 3: ";
print "@{ [scalar fn()] }\n";

That, when executed, yields:

Test 1: called in list context
Test 2: called in list context
Test 3: called in scalar context

*That's* why test 2 "doesn't work".

--tom



Re: Overlapping RFCs 135 138 164

2000-08-29 Thread Tom Christiansen

($foo = $bar) =~ s/x/y/; will never make much sense to me. 

What about these, which are much the same thing in that they all
use the lvaluability of assignment:

chomp($line = STDIN);
($foo = $bar) += 10;
($foo += 3) *= 2;
func($diddle_me = $protect_me);
$n = select($rout=$rin, $wout=$win, $eout=$ein, 2.5);

--tom



Re: Overlapping RFCs 135 138 164

2000-08-29 Thread Tom Christiansen

What about these, which are much the same thing in that they all
use the lvaluability of assignment:

And don't forget:

for (@new = @old) { s/foo/bar/ } 

--tom



Re: RFC 110 (v3) counting matches

2000-08-28 Thread Tom Christiansen

Have you ever wanted to count the number of matches of a patten?  s///g 
returns the number of matches it finds.  m//g just returns 1 for matching.
Counts can be made using s//$/g but this is wastefull, or by putting some 
counting loop round a m//g.  But this all seams rather messy. 

It's really much easier than all that:

$count = () = $string =~ /pattern/g;

--tom



Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()

2000-08-28 Thread Tom Christiansen

Simple solution.

If you want to require formats such as m/.../ (which I actually think is a
good idea), then make it part of -w, -W, -ww, or -WW, which would be a perl6
enhancement of strictness.

That's like having "use strict" enable mandatory perlstyle compliance
checks, and rejecting the program otherwise.  Doesn't seem sensible.

--tom



Re: RFC 158 (v1) Regular Expression Special Variables

2000-08-25 Thread Tom Christiansen

those early perl3 scripts by lwall floating around in /etc were poorly
written. i am glad they are finally out of the distribution.

Those weren't the scripts I was thinking about, and it is *NOT*
ipso facto true that something which uses $ or $` is poorly
written.

--tom



Re: RFC 145 (v2) Brace-matching for Perl Regular Expressions

2000-08-25 Thread Tom Christiansen

All in all, though, you're right that neither set of features is particularly
well-known/used outside of p5p followers. At least from what I've seen.
Virtually every person I've worked with since 5.6 came out has been surprised
and amazed at the REx eval stuff.

The completely reworked regex chapter in Camel III explains and demos all the
new 5.6 features.  I do not believe they will long remain the Cabal's secret.

--tom



Re: RFC 144 (v1) Behavior of empty regex should be simple

2000-08-24 Thread Tom Christiansen

I propose that this 'last successful match' behavior be discarded
entirely, and that an empty pattern always match the empty string.

I don't see a consideration for simply s/successful// above, which
has also been talked about.  Thas would also match expected usage
based upon existing editors.

--tom



Re: RFC 150 (v1) Extend regex syntax to provide for return of a hash of matched subpatterns

2000-08-24 Thread Tom Christiansen

This is useful in that it would stop being number dependent.
For example, you can't now safely say

/$var (foo) \1/

and guarantee for arbitrary contents of $var that your you have
the right number backref anymore.  

If I recall correctly, the Python folks addressed this.  One
might check that.

--tom