[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422757#comment-16422757 ] Hudson commented on NUTCH-2545: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3514 (See [https://builds.apache.org/job/Nutch-trunk/3514/]) NUTCH-2545 Upgrade to Any23 2.2 (lewis.mcgibbney: [https://github.com/apache/nutch/commit/5233a7993d619f55ec2b9149f57ef6d43f0a5453]) * (edit) src/plugin/any23/plugin.xml * (edit) src/plugin/any23/howto_upgrade_any23.txt * (edit) src/plugin/any23/ivy.xml * (edit) ivy/ivysettings.xml NUTCH-2545 Revert syntax correction to original implementation, add (lewis.mcgibbney: [https://github.com/apache/nutch/commit/40e92a54ebb5fe9e8c0ecbe9e04456d1c299703f]) * (edit) src/plugin/any23/plugin.xml * (edit) src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java * (edit) src/plugin/any23/ivy.xml NUTCH-2545 Upgrade to Any23 2.2 (lewis.mcgibbney: [https://github.com/apache/nutch/commit/d7e8a2661cd5bcc358f9e7a2511cf0450f9bc493]) * (edit) src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java * (edit) src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422687#comment-16422687 ] ASF GitHub Bot commented on NUTCH-2545: --- lewismc commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377965899 @HansBrende I removed the previous SAX logic due to the concerns viewed above. We can revisit when we release Any23 2.3 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422684#comment-16422684 ] ASF GitHub Bot commented on NUTCH-2545: --- lewismc closed pull request #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/ivy/ivysettings.xml b/ivy/ivysettings.xml index e4ef4831d..d9b504400 100644 --- a/ivy/ivysettings.xml +++ b/ivy/ivysettings.xml @@ -34,9 +34,6 @@ https://repository.apache.org/content/repositories/snapshots/; override="false"/> - http://svn.apache.org/repos/asf/any23/repo-ext/; -override="false"/> - - @@ -86,7 +75,6 @@ - diff --git a/src/plugin/any23/howto_upgrade_any23.txt b/src/plugin/any23/howto_upgrade_any23.txt index 45eb92e6c..cf0d07703 100644 --- a/src/plugin/any23/howto_upgrade_any23.txt +++ b/src/plugin/any23/howto_upgrade_any23.txt @@ -1,8 +1,6 @@ -1. Upgrade Any23 dependency in trunk/ivy/ivy.xml +1. Upgrade Any223 dependency in src/plugin/any23/ivy.xml -2. Upgrade Any223 dependency in src/plugin/any23/ivy.xml - -3. Upgrade Any23's own dependencies in src/plugin/any23/plugin.xml +2. Upgrade Any23's own dependencies in src/plugin/any23/plugin.xml To get the list of dependencies and their versions execute: $ ant -f ./build-ivy.xml - $ ls lib/ + $ ls lib | sed 's/^/ /g' diff --git a/src/plugin/any23/ivy.xml b/src/plugin/any23/ivy.xml index 4b526e26f..0e65d931d 100644 --- a/src/plugin/any23/ivy.xml +++ b/src/plugin/any23/ivy.xml @@ -36,13 +36,14 @@ - + + diff --git a/src/plugin/any23/plugin.xml b/src/plugin/any23/plugin.xml index 3b099cd2b..71c552215 100644 --- a/src/plugin/any23/plugin.xml +++ b/src/plugin/any23/plugin.xml @@ -25,173 +25,162 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java b/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java index e64131046..6cfd4211a 100644 --- a/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java +++ b/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java @@ -18,8 +18,6 @@ import java.io.ByteArrayOutputStream; import java.io.IOException; -import java.io.OutputStreamWriter; -import
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420214#comment-16420214 ] ASF GitHub Bot commented on NUTCH-2545: --- HansBrende commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377457712 @lewismc that's fine, but beware: the SAX pre-processor actually screws up a lot of triples! In the microdata test, that 39th triple extracted with the SAX pre-processor is actually a *bug* (see ANY23-340). Also, the SAX pre-processor removes all the namespaces specified in the html element. Which means that, even though the BBC test file specified the namespaces xmlns:og="http://opengraphprotocol.org/schema/; and xmlns:rnews="http://iptc.org/std/rNews/2011-10-07#;, the extractors ignored those, defaulting to "http://ogp.me/ns#; for the "og" namespace, and "rnews:; for the "rnews" namespace. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420202#comment-16420202 ] ASF GitHub Bot commented on NUTCH-2545: --- lewismc commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377456238 boom, for the time being we can upgrade to Any23. When the upgrade to 2.3 comes, then we can remove conditional logic. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420194#comment-16420194 ] ASF GitHub Bot commented on NUTCH-2545: --- HansBrende commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377455182 @lewismc I've figured out all the details of the bugs. Documented here in [ANY23-340](https://issues.apache.org/jira/projects/ANY23/issues/ANY23-340). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419674#comment-16419674 ] ASF GitHub Bot commented on NUTCH-2545: --- lewismc commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377356544 ACK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419671#comment-16419671 ] ASF GitHub Bot commented on NUTCH-2545: --- HansBrende commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377356007 @lewismc Hmm, well, I guess we still have some bugs in Any23 then. I will look into this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419640#comment-16419640 ] ASF GitHub Bot commented on NUTCH-2545: --- lewismc commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377348461 @HansBrende I removed the syntax improvement logic as this 'should' now be fixed in Any23 2.2... I am skeptical ad to whether this is however the case. ``` Testcase: extractMicroDataFromHTML took 5.19 sec FAILED We expect 40 tab-separated triples extracted by the filter expected:<39> but was:<38> junit.framework.AssertionFailedError: We expect 40 tab-separated triples extracted by the filter expected:<39> but was:<38> at org.apache.nutch.any23.TestAny23ParseFilter.extractMicroDataFromHTML(TestAny23ParseFilter.java:90) Testcase: ignoreUnsupported took 1.286 sec Testcase: testExtractTriplesFromHTML took 1.158 sec FAILED We expect 117 tab-separated triples extracted by the filter expected:<79> but was:<68> junit.framework.AssertionFailedError: We expect 117 tab-separated triples extracted by the filter expected:<79> but was:<68> at org.apache.nutch.any23.TestAny23ParseFilter.testExtractTriplesFromHTML(TestAny23ParseFilter.java:82) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419585#comment-16419585 ] ASF GitHub Bot commented on NUTCH-2545: --- HansBrende commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377336781 @lewismc Yes, the SAX logic can be removed! I would also like to see the jsonldjava dependency force-upgraded to 0.11.2 (or whatever the version after 0.11.1 will be called), after @ansell merges my fix for [ANY23-336](https://issues.apache.org/jira/browse/ANY23-336). This issue could obviously cause a crawler a lot of pain. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419578#comment-16419578 ] ASF GitHub Bot commented on NUTCH-2545: --- HansBrende commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377336781 @lewismc Yes, the SAX logic can be removed! I would also like to see the jsonldjava dependency force-upgraded to 0.11.2 (or whatever the version after 0.11.1 will be called), after @ansell merges my fix for ANY23-336. This issue could obviously cause a crawler a lot of pain. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419571#comment-16419571 ] ASF GitHub Bot commented on NUTCH-2545: --- HansBrende commented on issue #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306#issuecomment-377336781 @lewismc Yes, the SAX logic can be removed! I would also like to see the jsonldjava dependency force-upgraded to 0.11.2, after @ansell merges my fix for ANY23-336. This issue could obviously cause a crawler a lot of pain. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2545) Upgrade to Any23 2.2
[ https://issues.apache.org/jira/browse/NUTCH-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416157#comment-16416157 ] ASF GitHub Bot commented on NUTCH-2545: --- lewismc opened a new pull request #306: NUTCH-2545 Upgrade to Any23 2.2 URL: https://github.com/apache/nutch/pull/306 This issue addresses https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2545 and passes all unit tests. I have a feeling we can take the upgrade further and remove the [SAX fix logic](https://github.com/apache/nutch/blob/master/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L111-L116) previously implemented. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to Any23 2.2 > > > Key: NUTCH-2545 > URL: https://issues.apache.org/jira/browse/NUTCH-2545 > Project: Nutch > Issue Type: Improvement > Components: any23, plugin >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.15 > > > We recently released Any23 2.2. I would like to update the Any23 plugin to > this newest version. > PR coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)