> -----Original Message----- > From: John [mailto:[EMAIL PROTECTED] > Sent: 12 July 2003 07:31 > > I need to match a pattern, not in a single-line but from a > HTML page, which > obviously has loads of lines. I need to match 2 lines from > this HTML page: > 1) <HTML><TITLE>FirstVariable - Second Variable</TITLE></HTML>........ > 2) <TABLE><TD><TR>(newline) > ThirdVariable</TR></TD></TABLE>... > > I tried this code: > 1) preg_match("/<HTML><TITLE>(\S+) - (\S+)</TITLE></HTML>/", > $html_page, > $variables); > 2) preg_match("/<TABLE><TD><TR>\n(\S+)</TR></TD></TABLE>/", > $html_page, > $variables); > > The first 2 variables are matched into the $variables array > but not the > third one. Sometimes when the third one is matched, it starts > from where I > want it to start but takes all the text to the end of the > HTML document! > > Any ideas? Is there any characters that I should have escaped that I > didn't?? All I can think of is that because the first line > that I want to > match is on the FIRST LINE of the html page, that matches. > But reg-ex can't > handle the next line as its way down the page????
Firstly, your newsline may actually be any of \n, \r, or \r\n according as the file was built on a *nix, Mac or PC platform, so your regex should take account of this. Secondly, both of your examples should produce scads of errors attempting to parse the regular expression, because you have unescaped slashes and your delimiters are also slashes -- so either this is not a direct cut-and-paste of what's actually in your script, or you're not letting on about something else! The fix for this one is either to escape the slashes that are actually part of the match, or use something other than / as your delimiter. Thirdly, as you've got double quotes around the regex, it would be advisable to double the backslashes themselves (to ensure PHP doesn't attempt to interpret any of its own backslash sequences) -- either that or use single quotes to enclose the regex. Fourthly, and perhaps most importantly, by default the * and + modifiers are "greedy" -- that is, they match as much as possible consistent with the whole match succeeding. With your first match, this doesn't matter as there's only one </TITLE> in the document, so there's no ambiguity; with the second match, there could be any number of occurrences of </TR></TD></TITLE> in the document, and the greedy matching of \S+ means that it will always be the *last* of these that is found. The way to counter this is to use one of the "ungreedy" modifiers in your regex. Taking all of these into account, you probably want something like: preg_match('!<TABLE><TD><TR>(\n|\r\n|\r)(\S+)</TR></TD></TABLE>!U', $html_page, $variables); Cheers! Mike --------------------------------------------------------------------- Mike Ford, Electronic Information Services Adviser, Learning Support Services, Learning & Information Services, JG125, James Graham Building, Leeds Metropolitan University, Beckett Park, LEEDS, LS6 3QS, United Kingdom Email: [EMAIL PROTECTED] Tel: +44 113 283 2600 extn 4730 Fax: +44 113 283 3211 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php