[jira] [Resolved] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

Julien Nioche (JIRA) Fri, 30 Jan 2015 01:08:16 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julien Nioche resolved NUTCH-1918.
----------------------------------
    Resolution: Fixed

Committed revision 1655966.

Thanks Seb

> TikaParser specifies a default namespace when generating DOM
> ------------------------------------------------------------
>
>                 Key: NUTCH-1918
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1918
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>             Fix For: 1.10
>
>         Attachments: NUTCH-1918.patch
>
>
> The DOM generated by parse-tika differs from the one done by parse-html. 
> Ideally we should be able to use either parsers with the same XPath 
> expressions.
> This is related to [NUTCH-1592], but this time instead of being a matter of 
> uppercases, the problem comes from the namespace used. 
> This issue has been investigated and fixed in storm-crawler 
> [https://github.com/DigitalPebble/storm-crawler/pull/58].
> Here is what Guillaume explained there :
> bq. When parsing the content, Tika creates a properly formatted XHTML 
> document: all elements are created within the namespace XHTML.
> bq. However in XPath 1.0, there's no concept of default namespace so XPath 
> expressions such as //BODY doesn't match anything. To make this work we 
> should use //ns1:BODY and define a NamespaceContext which associates ns1 with 
> "http://www.w3.org/1999/xhtml";
> bq. To keep the XPathExpressions simpler, I modified the DOMBuilder which is 
> our SaxHandler used to convert the SAX Events into a DOM tree to ignore a 
> "default name space" and the ParserBolt initializes it with the XHTML 
> namespace. This way //BODY matches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-1918) TikaParser specifies a default namespace when generating DOM

Reply via email to