RE: regex question

King, Jason G Mon, 02 Dec 2002 13:12:50 -0800

Dmitry writes..

>**************
>* Jason King *
>**************
>I don't think it's wise to assume anything about the input.
>In any case, if you want a more mundane example that also 
>breaks your regex:  
>< body>
>
>RESPONSE : Jason, the original task was to match new line 
>characters in the BODY tag. That's why a normal regex is a 
>pefectly viable solution here. If I was designing a search 
>engine spider I'd use a module. And for the example that 
>breaks my code: is it that hard to fix? Here's a version 1.1 
>of my code:


The originator said "the problem is I found some pages that have a
<body> tag that looks something like this" - see that bit "I found some
pages" - that suggests to me that they do NOT have control over the
input. So yes, the originator asked about matching newlines, but an
expansive predictive answer that will save them trouble now and in the
future is to use a module to parse what CANNOT be reliably parsed with
regex.

>-------------------------------------
>#!/usr/bin/perl
>#bodytag.pl
>#regex that matches the body tag
>
>use warnings;
>use strict;
>
>$/ = ">";
>
>open FILE, "bodytag.htm" or die "Could not open the file 
>bodytag.htm: $!";
>       while ( <FILE> ) {
>               if ( /<\s*(BODY)(.*)>/igs ) {
>                       s/\n/ /g; 
>                       s/ {2,}/ /g; 
>                       s/^ *(<)\s*(BODY.*) *$/$1$2/ig;
>                       print "$_\n";
>               }
>       }
>close FILE;
>-------------------------------------

So, you really do want to play "break regex with HTML", here's another
mundane and common example that breaks your old and new code:

  <frameset>
    <!-- this used to be a <body> tag but I decided to use frames -->

Don't use regex to parse HTML unless you have a guaranteed (and that
usually means self-written) set of HTML.

>***************
>* Mark Mielke *
>***************
>
>It is highly doubtful that a single REGEXP that takes up less 
>than 80 characters of text is able to properly parse even limited HTML.
>
>RESPONSE : I didn't count how many characters my code uses 
>(should we use a perl module for that by the way, or is plain 
>perl enough ;-) ?), but I think I've just proven you wrong.

The less you think you're right, the less wrong you'll be. Again I
provide a simple, mundane example that your code fails to properly
parse: this was my whole initial point about modules. They've been
through this review process where heaps of eyes have looked at them and
debugged them - your code has not.

>******************
>* $Bill Luebkert *
>******************
>
>For general cases, I would probably agree.  For limited, 
>well-known or simple cases - I'd use a RE.  I think the 
>original example of this thread fell into the later (unless I 
>mis-read it).  :)
>
>RESPONSE : Exactly what I was saying. I'm glad I'm not alone 
>on this one. :-)

That's not exactly what you've been saying, it's not even close.

What you said was "Don't listen to those who are telling you to use
modules, when you can use built-in regexes or other functions." You
might now reflect and wish you'd started with a softer and more
reasonable position, but you didn't - you came out against modules.

There was no caveat about requiring known input, there was no warning
that your code would fail in many mundane situations, you told someone
not to use modules (which solve all of these problems) and instead use
your code (which does not). Had you had the caveats then no one would
have challenged your position, your position was "don't use modules, use
regex instead" which is not very helpful and begs to be challenged.

-- 
  Jason King
_______________________________________________
ActivePerl mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: regex question

Reply via email to