On 18 February 2011 22:36, Tommy Pham <tommy...@gmail.com> wrote:
> Hi folks,
>
> This is not directly relating to PHP but it's Friday so I'm gonna give
> it a shot :).  Would someone please help me figure out why my regex
> pattern doesn't work.  Below is the code and sample data:
>
> $html = <<<HTML
> <li class="small  tab "><a class="y-mast-link images"
> href="http://images.search.yahoo.com/images";
> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Images</span></a></li>
> <li class="small  tab "><a class="y-mast-link video"
> href="http://video.search.yahoo.com/video";
> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Video</span></a></li>
> <li class="small  tab "><a class="y-mast-link local"
> href="http://local.yahoo.com/results";
> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Local</span></a></li>
> <li class="small  tab "><a class="y-mast-link shopping"
> href="http://shopping.yahoo.com/search";
> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Shopping</span></a></li>
> <li class="small lasttab more-tab "><a class="y-mast-link more"
> href="http://tools.search.yahoo.com/about/forsearchers.html"; ><span
> class="tab-cover y-mast-bg-hide">More</span><span
> class="y-fp-pg-controls arrow"></span></a></li>
> HTML;
>
> $pattern = 
> '%<a\s[^href]*href\s*=\s*[\'|"]?([^\'|"|#]+)[\'|"]?\s*[^>]*>(.*)?</a>%im';
> preg_match_all($pattern, $html, $matches);
>
> The only matches I got is:
>
> Match 1 of 1:   <a class="y-mast-link local"
> href="http://local.yahoo.com/results";
> data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Local</span></a>
>
> Group 1:        http://local.yahoo.com/results
>
> Group 2:        <span class="tab-cover y-mast-bg-hide"
> style="padding-left:0em;padding-right:0em;">Local</span>
>
> The pattern I made was to work in cases where the page is
> non-compliant to any of standard W3.
>

Not entirely sure what your input data is, as I'm guessing one or more
mail programs may have added line breaks. When I run the code I get no
matches at all - so I'm guessing you might have different input on
your end. More specifically, I'm also guessing you have line breaks on
your end, but not equally distributed - which would explain the one
hit.
 Apart from that, there are a couple of things I'd rework in your regex:

%<a\s+.*?(?!href)\s+href\s*=\s*([^\s\'"]+|\'[^\']+\'|\"[^\"]+\")[^>]*>(.*?)</a>%ims

* added modifier to whitespace at first
* allowing for any character not followed by href (non-greedy)
* match the href
* use proper alternation
* capture anything inside the <a> tag, non-greedy
* match with a closing </a> tag

Results:
array(3) {
  [0]=>
  array(5) {
    [0]=>
    string(205) "<a class="y-mast-link images"
href="http://images.search.yahoo.com/images";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Images</span></a>"
    [1]=>
    string(201) "<a class="y-mast-link video"
href="http://video.search.yahoo.com/video";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Video</span></a>"
    [2]=>
    string(196) "<a class="y-mast-link local"
href="http://local.yahoo.com/results";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Local</span></a>"
    [3]=>
    string(204) "<a class="y-mast-link shopping"
href="http://shopping.yahoo.com/search";
data-b="http://www.yahoo.com";><span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Shopping</span></a>"
    [4]=>
    string(188) "<a class="y-mast-link more"
href="http://tools.search.yahoo.com/about/forsearchers.html"; ><span
class="tab-cover y-mast-bg-hide">More</span><span
class="y-fp-pg-controls arrow"></span></a>"
  }
  [1]=>
  array(5) {
    [0]=>
    string(39) ""http://images.search.yahoo.com/images"";
    [1]=>
    string(37) ""http://video.search.yahoo.com/video"";
    [2]=>
    string(32) ""http://local.yahoo.com/results"";
    [3]=>
    string(34) ""http://shopping.yahoo.com/search"";
    [4]=>
    string(55) ""http://tools.search.yahoo.com/about/forsearchers.html"";
  }
  [2]=>
  array(5) {
    [0]=>
    string(96) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Images</span>"
    [1]=>
    string(95) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Video</span>"
    [2]=>
    string(95) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Local</span>"
    [3]=>
    string(98) "<span class="tab-cover y-mast-bg-hide"
style="padding-left:0em;padding-right:0em;">Shopping</span>"
    [4]=>
    string(94) "<span
class="tab-cover y-mast-bg-hide">More</span><span
class="y-fp-pg-controls arrow"></span>"
  }


-- 
<hype>
WWW: plphp.dk / plind.dk
LinkedIn: plind
BeWelcome/Couchsurfing: Fake51
Twitter: kafe15
</hype>

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to