From: work at danemacmillan dot com Operating system: centos6.4 (64bit), win7 (64bit) PHP version: 5.4.15 Package: *XML functions Bug Type: Bug Bug description:DomDocument getAttribute return empty on IMG/src, LINK/href, with protocol-rela
Description: ------------ DomDocument's getAttribute will return an empty string on both an IMG tag's SRC attribute, and a LINK's HREF attribute *when* the URLs provided are protocol-relative. In the description below I'm going to refer to IMG tags only, but the same issue stands for the LINK tag's HREF attribute as well. There may be others, but these are the only two I discovered. The problem arises when scraping the SRC of IMG tags. If the IMG tag SRC has a protocol-relative URL (an absolute path beginning with "//" instead of "http://"), it will be unreadable. Relative paths are readable (e.g., "/img/landing.png"). I've used both the getElementsByTagName method and the Xpath method. They both suffer from the same problem. However, the moment I prepend any absolute URL with its designated protocol, the IMG SRC is readable. This problem does not exist for the A tag HREF attribute, nor the SCRIPT tag SRC attribute; in each case the URL provided will be returned, regardless of the URL format. Test script: --------------- To summarize, the first two URLs will be readable, and the third URL will *not* be readable: <img src="http://www.example.com/img/eg.png" /> <img src="/img/eg.png" /> <img src="//www.example.com/img/eg.png" /> I'll demonstrate two ways to grab the URLs, and they both fail with protocol-relative URLs. // use for both examples: $html = file_get_contents("http://www.example.com"); // one way (Xpath) $dom = new DOMDocument(); @$dom->loadHTML($html); $x = new DOMXPath($dom); $data = array(); foreach($x->query('//img') as $node) { $data['img']['src'][] = urldecode($node->getAttribute('src')); } // another way (getElementsByTagName) $doc = new DOMDocument(); @$doc->loadHTML($html); $imgs = $doc->getElementsByTagName('img'); $data = array(); for ($i = 0; $i < $imgs->length; $i++) { $img = $imgs->item($i); if($img->getAttribute('src')) { $data[] = urldecode($img->getAttribute('src')); } } // print results from either print_r($data); Expected result: ---------------- Array ( [img] => Array ( [src] => Array ( [0] => //www.example.com/img/eg.png ) ) ) Actual result: -------------- Array ( [img] => Array ( [src] => Array ( [0] => ) ) ) -- Edit bug report at https://bugs.php.net/bug.php?id=64946&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=64946&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=64946&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=64946&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=64946&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=64946&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=64946&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=64946&r=needscript Try newer version: https://bugs.php.net/fix.php?id=64946&r=oldversion Not developer issue: https://bugs.php.net/fix.php?id=64946&r=support Expected behavior: https://bugs.php.net/fix.php?id=64946&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=64946&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=64946&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=64946&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=64946&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=64946&r=dst IIS Stability: https://bugs.php.net/fix.php?id=64946&r=isapi Install GNU Sed: https://bugs.php.net/fix.php?id=64946&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=64946&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=64946&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=64946&r=mysqlcfg