On 18 February 2011 22:36, Tommy Pham <[email protected]> wrote:
> Hi folks,
>
> This is not directly relating to PHP but it's Friday so I'm gonna give
> it a shot :). Would someone please help me figure out why my regex
> pattern doesn't work. Below is the code and sample data:
>
> $html = <<<HTML
> <li class="small tab "><a class="y-mast-link images"
> href="http://images.search.yahoo.com/images"
> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Images</span></a></li>
> <li class="small tab "><a class="y-mast-link video"
> href="http://video.search.yahoo.com/video"
> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Video</span></a></li>
> <li class="small tab "><a class="y-mast-link local"
> href="http://local.yahoo.com/results"
> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Local</span></a></li>
> <li class="small tab "><a class="y-mast-link shopping"
> href="http://shopping.yahoo.com/search"
> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Shopping</span></a></li>
> <li class="small lasttab more-tab "><a class="y-mast-link more"
> href="http://tools.search.yahoo.com/about/forsearchers.html" ><span
> class="tab-cover y-mast-bg-hide">More</span><span
> class="y-fp-pg-controls arrow"></span></a></li>
> HTML;
>
> $pattern =
> '%<a\s[^href]*href\s*=\s*[\'|"]?([^\'|"|#]+)[\'|"]?\s*[^>]*>(.*)?</a>%im';
> preg_match_all($pattern, $html, $matches);
>
> The only matches I got is:
>
> Match 1 of 1: <a class="y-mast-link local"
> href="http://local.yahoo.com/results"
> data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Local</span></a>
>
> Group 1: http://local.yahoo.com/results
>
> Group 2: <span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Local</span>
>
> The pattern I made was to work in cases where the page is
> non-compliant to any of standard W3.
>
Not entirely sure what your input data is, as I'm guessing one or more
mail programs may have added line breaks. When I run the code I get no
matches at all - so I'm guessing you might have different input on
your end. More specifically, I'm also guessing you have line breaks on
your end, but not equally distributed - which would explain the one
hit.
Apart from that, there are a couple of things I'd rework in your regex:
%<a\s+.*?(?!href)\s+href\s*=\s*([^\s\'"]+|\'[^\']+\'|\"[^\"]+\")[^>]*>(.*?)</a>%ims
* added modifier to whitespace at first
* allowing for any character not followed by href (non-greedy)
* match the href
* use proper alternation
* capture anything inside the <a> tag, non-greedy
* match with a closing </a> tag
Results:
array(3) {
[0]=>
array(5) {
[0]=>
string(205) "<a class="y-mast-link images"
href="http://images.search.yahoo.com/images"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Images</span></a>"
[1]=>
string(201) "<a class="y-mast-link video"
href="http://video.search.yahoo.com/video"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Video</span></a>"
[2]=>
string(196) "<a class="y-mast-link local"
href="http://local.yahoo.com/results"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Local</span></a>"
[3]=>
string(204) "<a class="y-mast-link shopping"
href="http://shopping.yahoo.com/search"
data-b="http://www.yahoo.com"><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Shopping</span></a>"
[4]=>
string(188) "<a class="y-mast-link more"
href="http://tools.search.yahoo.com/about/forsearchers.html" ><span
class="tab-cover y-mast-bg-hide">More</span><span
class="y-fp-pg-controls arrow"></span></a>"
}
[1]=>
array(5) {
[0]=>
string(39) ""http://images.search.yahoo.com/images""
[1]=>
string(37) ""http://video.search.yahoo.com/video""
[2]=>
string(32) ""http://local.yahoo.com/results""
[3]=>
string(34) ""http://shopping.yahoo.com/search""
[4]=>
string(55) ""http://tools.search.yahoo.com/about/forsearchers.html""
}
[2]=>
array(5) {
[0]=>
string(96) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Images</span>"
[1]=>
string(95) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Video</span>"
[2]=>
string(95) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Local</span>"
[3]=>
string(98) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Shopping</span>"
[4]=>
string(94) "<span
class="tab-cover y-mast-bg-hide">More</span><span
class="y-fp-pg-controls arrow"></span>"
}
--
<hype>
WWW: plphp.dk / plind.dk
LinkedIn: plind
BeWelcome/Couchsurfing: Fake51
Twitter: kafe15
</hype>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php