Re: [PHP] Need help with RegEx

Michael Mon, 11 Dec 2006 04:13:06 -0800

I just realized I neglected to explain a couple of things here, sorry...

My method will only work for the FIRST occurrence of the div tag pair in 
$source_html.


The reason this method works is that you are telling preg_replace to replace 
everything that matches the match pattern, with just what is contained in the 
third atom of the match pattern. Since we are matching everything between the 
start of $source_html and the end of $source_html (the (.*?) atom at the 
beginning, and the (.*?)^ atom at the end) your return value ends up being $3, 
or the contents of the third atom of the match pattern, which represents the 
text between the opening tag and closing tag of your div element.

hope this makes sense, I'm writing this at 5am heh

Cheers,
Michael

At 04:58 AM 12/11/2006 , Michael wrote:
>At 01:02 AM 12/11/2006 , Anthony Papillion wrote:
>>Hello Everyone,
>>
>>I am having a bit of problems wrapping my head around regular expressions. I 
>>thought I had a good grip on them but, for some reason, the expression I've 
>>created below simply doesn't work! Basically, I need to retreive all of the 
>>text between two unique and specific tags but I don't need the tag text. So 
>>let's say that the tag is
>>
>><tag lang='ttt'>THIS IS A TEST</tag>
>>
>>I would need to retreive THIS IS A TEST only and nothing else.
>>
>>Now, a bit more information: I am using cURL to retreive the entire contents 
>>of a webpage into a variable. I am then trying to perform the following 
>>regular expression on the retreived text:
>>
>>$trans_text = preg_match("\/<div id=result_box dir=ltr>(.+?)<\/div>/");
>
>Using the tags you describe here, and assuming the source html is in the
>variable $source_html, try this:
>
>$trans_text = preg_replace("/(.*?)(<div id=result_box
>dir=ltr>)(.*?)(<\/div>)(.*?)^/s","$3",$source_html);
>
>how this breaks down is:
> 
>opening quote for first parameter (your MATCH pattern).
>
>open regex match pattern= /
>
>first atom (.*?) = any or no leading text before <div id=result_box dir=ltr>,
>the ? makes it non-greedy so that it stops after finding the first match.
>
>second atom (<div id=result_box dir=ltr>) = the opening tag you are looking 
>for.
>
>third atom (.*?) = the text you want to strip out, all text even if nothing is
>there, between the 2nd and
>4th atoms.
>
>fourth atom (<\/div>) = the closing tag of the div tag pair.
>
>fifth atom (.*?) = all of the rest of the source html after the closing tag up
>to the end of the line ^,even if there is nothing there.
>
>close regex match pattern= /s
>
>in order for this to work on html that may contain newlines, you must specify
>that the . can represent newline characters, this is done by adding the letter
>'s' after your regex closing /, so the last thing in your regex match pattern
>would be /s.
>
>end of string ^ (this matches the end of the string you are matching/replacing
>, $source_html)
>
>closing quote for first parameter.
>
>The second parameter of the preg_replace is the atom # which contains the text
>you want to replace the text matched by the regex match pattern in the first
>parameter, in this case the text we want is in the third atom so this parameter
>would be $3 (this is the PHP way of back-referencing, if we wanted the text
>before the tag we would use atom 1, or $1, if we want the tag itself we use $2,
>etc basically a $ followed by the atom # that holds what we want to replace the
>$source_html into $trans_text).
>
>The third parameter of the preg_replace is the source you wish to match and
>replace from, in this case your source html in $source_html.
>
>after this executes, $trans_text should contain the innerText of the <div
>id=result_box dir=ltr></div> tag pair from $source_html, if there is nothing
>between the opening and closing tags, $trans_text will == "", if there is only
>a newline between the tags, $trans_text will == "\n". IMPORTANT: if the text
>between the tags contains a newline, $trans_text will also contain that newline
>character because we told . to match newlines.
>
>I am no regex expert by far, but this worked for me (assuming I copied it
>correctly here heh)
>There are doubtless many other ways to do this, and I am sure others on the
>list here will correct me if my way is wrong or inefficient.
>
>I hope this works for you and that I haven't horribly embarassed myself here.
>Good luck :)
>
>>
>>The problem is that when I echo the value of $trans_text variable, I end up 
>>with the entire HTML of the page.
>>
>>Can anyone clue me in to what I am doing wrong?
>>
>>Thanks,
>>Anthony 
>>
>>-- 
>>PHP General Mailing List (http://www.php.net/)
>>To unsubscribe, visit: http://www.php.net/unsub.php
>>  

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP] Need help with RegEx

Reply via email to