Julien Nioche created NUTCH-1918:
------------------------------------
Summary: TikaParser specifies a default namespace when generating
DOM
Key: NUTCH-1918
URL: https://issues.apache.org/jira/browse/NUTCH-1918
Project: Nutch
Issue Type: Bug
Components: parser
Reporter: Julien Nioche
Fix For: 1.10
The DOM generated by parse-tika differs from the one done by parse-html.
Ideally we should be able to use either parsers with the same XPath expressions.
This is related to [NUTCH-1592], but this time instead of being a matter of
uppercases, the problem comes from the namespace used.
This issue has been investigated and fixed in storm-crawler
[https://github.com/DigitalPebble/storm-crawler/pull/58].
Here is what Guillaume explained there :
bq. When parsing the content, Tika creates a properly formatted XHTML document:
all elements are created within the namespace XHTML.
bq. However in XPath 1.0, there's no concept of default namespace so XPath
expressions such as //BODY doesn't match anything. To make this work we should
use //ns1:BODY and define a NamespaceContext which associates ns1 with
"http://www.w3.org/1999/xhtml"
bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is
our SaxHandler used to convert the SAX Events into a DOM tree to ignore a
"default name space" and the ParserBolt initializes it with the XHTML
namespace. This way //BODY matches.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)