[
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884818#comment-13884818
]
Sebastian Nagel edited comment on NUTCH-1253 at 1/28/14 10:57 PM:
------------------------------------------------------------------
Hi [~lewismc], the HTML which fails to parse looks not really incorrect: both
{{<a name="..."/>}} and {{<iframe src="..."/>}} are empty XML-style tags
(bachelor tags).
According to Neko's [Change
History|http://nekohtml.sourceforge.net/changes.html] a configuration feature
"allow-selfclosing-iframe" was introduced in v1.19.15. If the feature is set to
true, the problematic document is parsed successfully.
Attached patch adds "allow-selfclosing-iframe" for both parse-html (parse
plugin and test) and parse-tika (test only). Tests now pass.
Note: changes related to upgrade of Neko are contained in patch, but debug
output must be removed.
was (Author: wastl-nagel):
Hi [~lewismc], the HTML which fails to parse looks not really incorrect: both
{{<a name="..."/>}} and {{<iframe src="..."/>}} are empty XML-style tags
(bachelor tags).
According to Neko's [Change
History|http://nekohtml.sourceforge.net/changes.html] a configuration feature
"allow-selfclosing-iframe" was introduced in v1.19.15. If the feature is set to
true, the problematic document is parsed successfully.
Attached patch adds "allow-selfclosing-iframe" for both parse-html (parse
plugin and test) and parse-tika (test only). Tests now pass.
> Incompatible neko and xerces versions
> -------------------------------------
>
> Key: NUTCH-1253
> URL: https://issues.apache.org/jira/browse/NUTCH-1253
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Environment: Ubuntu 10.04
> Reporter: Dennis Spathis
> Assignee: Lewis John McGibbney
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch,
> NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch,
> TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt,
> TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt,
> nutch1253parsed.html, nutch1253test.html
>
>
> The Nutch 1.4 distribution includes
> - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
> nekohtml)
> - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
> These two JARs appear to be incompatible versions. When the HtmlParser
> (configured to use neko) is invoked during a local-mode crawl, the parse
> fails due to an AbstractMethodError. (Note: To see the AbstractMethodError,
> rebuild the HtmlParser plugin and add a
> catch(Throwable) clause in the getParse method to log the stacktrace.)
> I found that substituting a later, compatible version of nekohtml (1.9.11)
> fixes the problem.
> Curiously, and in support of the above, the nekohtml plugin.xml file in
> Nutch 1.4 contains the following:
> <plugin
> id="lib-nekohtml"
> name="CyberNeko HTML Parser"
> version="1.9.11"
> provider-name="org.cyberneko">
> <runtime>
> <library name="nekohtml-0.9.5.jar">
> <export name="*"/>
> </library>
> </runtime>
> </plugin>
> Note the conflicting version numbers (version tag is "1.9.11" but the
> specified library is "nekohtml-0.9.5.jar").
> Was the 0.9.5 version included by mistake? Was the intention rather to
> include 1.9.11?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)