Edit report at https://bugs.php.net/bug.php?id=64946&edit=1

 ID:                 64946
 User updated by:    work at danemacmillan dot com
 Reported by:        work at danemacmillan dot com
 Summary:            DomDocument getAttribute return empty on IMG/src,
                     LINK/href, with protocol-rela
-Status:             Open
+Status:             Closed
 Type:               Bug
 Package:            *XML functions
 Operating System:   centos6.4 (64bit), win7 (64bit)
 PHP Version:        5.4.15
 Block user comment: N
 Private report:     N

 New Comment:

I don't know how, but the exact same code did not work for two days. After 
reporting a bug, it works.


Previous Comments:
------------------------------------------------------------------------
[2013-05-29 22:29:31] work at danemacmillan dot com

Description:
------------
DomDocument's getAttribute will return an empty string on both an IMG tag's SRC 
attribute, and a LINK's HREF attribute *when* the URLs provided are 
protocol-relative. 

In the description below I'm going to refer to IMG tags only, but the same 
issue stands for the LINK tag's HREF attribute as well. There may be others, 
but these are the only two I discovered. 

The problem arises when scraping the SRC of IMG tags. If the IMG tag SRC has a 
protocol-relative URL (an absolute path beginning with "//" instead of 
"http://";), it will be unreadable. Relative paths are readable (e.g., 
"/img/landing.png").

I've used both the getElementsByTagName method and the Xpath method. They both 
suffer from the same problem. However, the moment I prepend any absolute URL 
with its designated protocol, the IMG SRC is readable. 

This problem does not exist for the A tag HREF attribute, nor the SCRIPT tag 
SRC attribute; in each case the URL provided will be returned, regardless of 
the URL format.

Test script:
---------------
To summarize, the first two URLs will be readable, and the third URL will *not* 
be readable:

<img src="http://www.example.com/img/eg.png"; />
<img src="/img/eg.png" />
<img src="//www.example.com/img/eg.png" />

I'll demonstrate two ways to grab the URLs, and they both fail with 
protocol-relative URLs.

// use for both examples:
$html = file_get_contents("http://www.example.com";);

// one way (Xpath)
$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom); 
$data = array();
foreach($x->query('//img') as $node) {
    $data['img']['src'][] = urldecode($node->getAttribute('src'));
}

// another way (getElementsByTagName)
$doc = new DOMDocument();
@$doc->loadHTML($html);
$imgs = $doc->getElementsByTagName('img');
$data = array();
for ($i = 0; $i < $imgs->length; $i++) {
    $img = $imgs->item($i);
    if($img->getAttribute('src'))
    {
        $data[] = urldecode($img->getAttribute('src'));
    }
} 

// print results from either
print_r($data);

Expected result:
----------------
Array ( [img] => Array ( [src] => Array ( [0] => //www.example.com/img/eg.png ) 
) )

Actual result:
--------------
Array ( [img] => Array ( [src] => Array ( [0] =>  ) ) )


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=64946&edit=1

Reply via email to