[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2014-03-11 Thread Talat UYARER (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Talat UYARER updated NUTCH-1253:


Attachment: NUTCH-1253-2.x-eclipse.patch

[~icebergx5] is right. At the present  2.x branch does not work with eclipse. 
Eclipse says  Missing required library about neko 0.9.5. I think [~lewismc] 
forget adding nekohtml dependecy for eclipse target in build.xml. I create a 
bugfix patch for 2.x. 


 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-eclipse.patch, NUTCH-1253-2.x-v2.patch, 
 NUTCH-1253-nutchgora.patch, NUTCH-1253-trunk.patch, 
 NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 nutch1253parsed.html, nutch1253test.html


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-28 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1253:
---

Attachment: NUTCH-1253-trunk.v2.patch

Hi [~lewismc], the HTML which fails to parse looks not really incorrect: both 
{{a name=.../}} and {{iframe src=.../}} are empty XML-style tags 
(bachelor tags).
According to Neko's [Change 
History|http://nekohtml.sourceforge.net/changes.html] a configuration feature 
allow-selfclosing-iframe was introduced in v1.19.15. If the feature is set to 
true, the problematic document is parsed successfully.
Attached patch adds allow-selfclosing-iframe for both parse-html (parse 
plugin and test) and parse-tika (test only). Tests now pass. 

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253-trunk.patch, NUTCH-1253-trunk.v2.patch, NUTCH-1253.patch, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 nutch1253parsed.html, nutch1253test.html


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-27 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Attachment: NUTCH-1253-trunk.patch

Patch for trunk.
The lapse in our testing code was a combination of the closing /a and the 
iframe tag which needed to be closed as well.
I've made the changes to TestDOMContentutils in parse-html and parse-tika, 
updated all plugin and ivy configuration and also formatted TestDOMContentUtils 
as per the Nutch code formatting.
This was a paint o track down but eventually we got there.
Thanks if anyone can review.   

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253-trunk.patch, NUTCH-1253.patch, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 nutch1253parsed.html, nutch1253test.html


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Attachment: TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt

Actually, I can confirm that this upgrade seems to break tests in 
TestDOMContentUtils (see attached). It seems that the document fragment 
contains some markup... which is incorrect. Having sat in Eclipse for ages 
debugging this, I am now over on the nekohtml user list trying to sort this 
out. If you guys have any ideas then please chip in.
I am not sure whether it's the way we use Xerces2, Neko or maybe a bug in 
DomContentUtils but there is undesired behavior anyway.

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2014-01-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1253:
---

Attachment: nutch1253test.html
nutch1253parsed.html

It's likely a regression in NekoHTML: {{a name=bottom/}} encloses 
erroneously the rest of the document inclusively {{/body/html}} which is 
interpreted as textual content. See attached document (taken from failed test 
unit) and output by parse-html using Neko 1.9.15/19.

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt, 
 nutch1253parsed.html, nutch1253test.html


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2013-05-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1253:
---

Fix Version/s: 1.8

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
Assignee: Lewis John McGibbney
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1253-2.x-v2.patch, NUTCH-1253-nutchgora.patch, 
 NUTCH-1253.patch, TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Fix Version/s: 2.2
   1.7

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Attachment: NUTCH-1253-nutchgora.patch
NUTCH-1253.patch

Trivial patches for both trunk and Nutchgora branch. Can you guys please test 
and get back on this issue. Thanks 

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Patch Info: Patch Available

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira