Tim Allison created NUTCH-3000:
----------------------------------

             Summary: protocol-selenium returns only the body,strips off the 
<head/> element
                 Key: NUTCH-3000
                 URL: https://issues.apache.org/jira/browse/NUTCH-3000
             Project: Nutch
          Issue Type: Bug
          Components: protocol
            Reporter: Tim Allison


The selenium protocol returns only the body portion of the html, which means 
that neither the title nor the other page metadata in the <head/> section gets 
extracted.

{noformat}
String innerHtml = driver.findElement(By.tagName("body"))
                        .getAttribute("innerHTML");
{noformat}

We should return the full html, no?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to