RE: [PHP] REGULAR EXPRESSION HELP

Ford, Mike [LSS] Mon, 14 Jul 2003 05:33:38 -0700

> -----Original Message-----
> From: John [mailto:[EMAIL PROTECTED]
> Sent: 12 July 2003 07:31
> 
> I need to match a pattern, not in a single-line but from a 
> HTML page, which
> obviously has loads of lines. I need to match 2 lines from 
> this HTML page:
> 1) <HTML><TITLE>FirstVariable - Second Variable</TITLE></HTML>........
> 2) <TABLE><TD><TR>(newline)
> ThirdVariable</TR></TD></TABLE>...
> 
> I tried this code:
> 1) preg_match("/<HTML><TITLE>(\S+) - (\S+)</TITLE></HTML>/", 
> $html_page,
> $variables);
> 2) preg_match("/<TABLE><TD><TR>\n(\S+)</TR></TD></TABLE>/", 
> $html_page,
> $variables);
> 
> The first 2 variables are matched into the $variables array 
> but not the
> third one. Sometimes when the third one is matched, it starts 
> from where I
> want it to start but takes all the text to the end of the 
> HTML document!
> 
> Any ideas? Is there any characters that I should have escaped that I
> didn't?? All I can think of is that because the first line 
> that I want to
> match is on the FIRST LINE of the html page, that matches. 
> But reg-ex can't
> handle the next line as its way down the page????


Firstly, your newsline may actually be any of \n, \r, or \r\n according as
the file was built on a *nix, Mac or PC platform, so your regex should take
account of this.

Secondly, both of your examples should produce scads of errors attempting to
parse the regular expression, because you have unescaped slashes and your
delimiters are also slashes -- so either this is not a direct cut-and-paste
of what's actually in your script, or you're not letting on about something
else!  The fix for this one is either to escape the slashes that are
actually part of the match, or use something other than / as your delimiter.

Thirdly, as you've got double quotes around the regex, it would be advisable
to double the backslashes themselves (to ensure PHP doesn't attempt to
interpret any of its own backslash sequences) -- either that or use single
quotes to enclose the regex.

Fourthly, and perhaps most importantly, by default the * and + modifiers are
"greedy" -- that is, they match as much as possible consistent with the
whole match succeeding.  With your first match, this doesn't matter as
there's only one </TITLE> in the document, so there's no ambiguity; with the
second match, there could be any number of occurrences of </TR></TD></TITLE>
in the document, and the greedy matching of \S+ means that it will always be
the *last* of these that is found.  The way to counter this is to use one of
the "ungreedy" modifiers in your regex.

Taking all of these into account, you probably want something like:

preg_match('!<TABLE><TD><TR>(\n|\r\n|\r)(\S+)</TR></TD></TABLE>!U',
$html_page, $variables);

Cheers!

Mike

---------------------------------------------------------------------
Mike Ford,  Electronic Information Services Adviser,
Learning Support Services, Learning & Information Services,
JG125, James Graham Building, Leeds Metropolitan University,
Beckett Park, LEEDS,  LS6 3QS,  United Kingdom
Email: [EMAIL PROTECTED]
Tel: +44 113 283 2600 extn 4730      Fax:  +44 113 283 3211

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

RE: [PHP] REGULAR EXPRESSION HELP

Reply via email to