[
https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved NUTCH-1918.
----------------------------------
Resolution: Fixed
Committed revision 1655966.
Thanks Seb
> TikaParser specifies a default namespace when generating DOM
> ------------------------------------------------------------
>
> Key: NUTCH-1918
> URL: https://issues.apache.org/jira/browse/NUTCH-1918
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Reporter: Julien Nioche
> Fix For: 1.10
>
> Attachments: NUTCH-1918.patch
>
>
> The DOM generated by parse-tika differs from the one done by parse-html.
> Ideally we should be able to use either parsers with the same XPath
> expressions.
> This is related to [NUTCH-1592], but this time instead of being a matter of
> uppercases, the problem comes from the namespace used.
> This issue has been investigated and fixed in storm-crawler
> [https://github.com/DigitalPebble/storm-crawler/pull/58].
> Here is what Guillaume explained there :
> bq. When parsing the content, Tika creates a properly formatted XHTML
> document: all elements are created within the namespace XHTML.
> bq. However in XPath 1.0, there's no concept of default namespace so XPath
> expressions such as //BODY doesn't match anything. To make this work we
> should use //ns1:BODY and define a NamespaceContext which associates ns1 with
> "http://www.w3.org/1999/xhtml"
> bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is
> our SaxHandler used to convert the SAX Events into a DOM tree to ignore a
> "default name space" and the ParserBolt initializes it with the XHTML
> namespace. This way //BODY matches.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)