[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764791#comment-17764791 ]
ASF GitHub Bot commented on NUTCH-3000: --------------------------------------- tballison merged PR #773: URL: https://github.com/apache/nutch/pull/773 > protocol-selenium returns only the body,strips off the <head/> element > ---------------------------------------------------------------------- > > Key: NUTCH-3000 > URL: https://issues.apache.org/jira/browse/NUTCH-3000 > Project: Nutch > Issue Type: Bug > Components: protocol > Reporter: Tim Allison > Priority: Major > > The selenium protocol returns only the body portion of the html, which means > that neither the title nor the other page metadata in the <head/> section > gets extracted. > {noformat} > String innerHtml = driver.findElement(By.tagName("body")) > .getAttribute("innerHTML"); > {noformat} > We should return the full html, no? -- This message was sent by Atlassian Jira (v8.20.10#820010)