Tim Allison created NUTCH-3000: ---------------------------------- Summary: protocol-selenium returns only the body,strips off the <head/> element Key: NUTCH-3000 URL: https://issues.apache.org/jira/browse/NUTCH-3000 Project: Nutch Issue Type: Bug Components: protocol Reporter: Tim Allison
The selenium protocol returns only the body portion of the html, which means that neither the title nor the other page metadata in the <head/> section gets extracted. {noformat} String innerHtml = driver.findElement(By.tagName("body")) .getAttribute("innerHTML"); {noformat} We should return the full html, no? -- This message was sent by Atlassian Jira (v8.20.10#820010)