[jira] [Comment Edited] (NUTCH-2512) Nutch does not build under JDK9
[ https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503269#comment-16503269 ] Ralf edited comment on NUTCH-2512 at 6/6/18 1:45 PM: - I just compiled master/trunk on a VM-Box with Ubuntu Bionic and Oracle Java 10.1 - It trows a couple of warnings, but compiles and I have it doing a small crawl right now and so far so good. Nutch now no longer takes the Solr url from the commandline, this should be reflected in the tutorials and docs by the time 1.15 gets released. (I still can't compile Nutch with Tika 1.18 on my Java 8 set-up, it works when I revert to Tika 1.17, I wonder what could be wrong with my Java set-up)... Correction - actually it doesn't index to Solr and fails with: at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://192.168.178.20:8983/solr/#/nutch: Expected mime type application/octet-stream but got text/html. Error 405 HTTP method POST is not supported by this URL HTTP ERROR 405 Problem accessing /solr/index.html. Reason: HTTP method POST is not supported by this URL was (Author: bl4ck1c3): I just compiled master/trunk on a VM-Box with Ubuntu Bionic and Oracle Java 10.1 - It trows a couple of warnings, but compiles and I have it doing a small crawl right now and so far so good. Nutch now no longer takes the Solr url from the commandline, this should be reflected in the tutorials and docs by the time 1.15 gets released. (I still can't compile Nutch with Tika 1.18 on my Java 8 set-up, it works when I revert to Tika 1.17, I wonder what could be wrong with my Java set-up) > Nutch does not build under JDK9 > --- > > Key: NUTCH-2512 > URL: https://issues.apache.org/jira/browse/NUTCH-2512 > Project: Nutch > Issue Type: Bug > Components: build, injector >Affects Versions: 1.14 > Environment: Ubuntu 16.04 (All patches up to 02/20/2018) > Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018) >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > > Nutch 1.14 (Source) does not compile properly under JDK 9 > Nutch 1.14 (Binary) does not function under Java 9 > > When trying to Nuild Nutch, Ant complains about missing Sonar files then > exits with: > "BUILD FAILED > /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" " > > Once having commented out the "offending code" the Build finishes but the > resulting Binary fails to function (as well as the Apache Compiled Binary > distribution), Both exit with: > > Injecting seed URLs > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Injector: starting at 2018-02-21 02:02:16 > Injector: crawlDb: searchcrawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method > sun.security.krb5.Config.getInstance() > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > Injector: java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413) > at > org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at
[jira] [Commented] (NUTCH-2512) Nutch does not build under JDK9
[ https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503269#comment-16503269 ] Ralf commented on NUTCH-2512: - I just compiled master/trunk on a VM-Box with Ubuntu Bionic and Oracle Java 10.1 - It trows a couple of warnings, but compiles and I have it doing a small crawl right now and so far so good. Nutch now no longer takes the Solr url from the commandline, this should be reflected in the tutorials and docs by the time 1.15 gets released. (I still can't compile Nutch with Tika 1.18 on my Java 8 set-up, it works when I revert to Tika 1.17, I wonder what could be wrong with my Java set-up) > Nutch does not build under JDK9 > --- > > Key: NUTCH-2512 > URL: https://issues.apache.org/jira/browse/NUTCH-2512 > Project: Nutch > Issue Type: Bug > Components: build, injector >Affects Versions: 1.14 > Environment: Ubuntu 16.04 (All patches up to 02/20/2018) > Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018) >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > > Nutch 1.14 (Source) does not compile properly under JDK 9 > Nutch 1.14 (Binary) does not function under Java 9 > > When trying to Nuild Nutch, Ant complains about missing Sonar files then > exits with: > "BUILD FAILED > /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" " > > Once having commented out the "offending code" the Build finishes but the > resulting Binary fails to function (as well as the Apache Compiled Binary > distribution), Both exit with: > > Injecting seed URLs > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Injector: starting at 2018-02-21 02:02:16 > Injector: crawlDb: searchcrawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method > sun.security.krb5.Config.getInstance() > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > Injector: java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413) > at > org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.Injector.inject(Injector.java:417) > at org.apache.nutch.crawl.Injector.run(Injector.java:563) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.Injector.main(Injector.java:528) > > Error running: > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Failed with exit value 255. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2584) Upgrade parse-tika to use Tika 1.18
[ https://issues.apache.org/jira/browse/NUTCH-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491041#comment-16491041 ] Ralf commented on NUTCH-2584: - Hi, Just tried it, still get the same error at compile time. > Upgrade parse-tika to use Tika 1.18 > --- > > Key: NUTCH-2584 > URL: https://issues.apache.org/jira/browse/NUTCH-2584 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.15 > > > Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core. > See > [howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2584) Upgrade parse-tika to use Tika 1.18
[ https://issues.apache.org/jira/browse/NUTCH-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489311#comment-16489311 ] Ralf commented on NUTCH-2584: - Hi, Tried this.. for me it does not work. Compiler exits with: [ivy:resolve] ERRORS [ivy:resolve] impossible to get artifacts when data has not been loaded. IvyNode = javax.measure#unit-api;1.0 > Upgrade parse-tika to use Tika 1.18 > --- > > Key: NUTCH-2584 > URL: https://issues.apache.org/jira/browse/NUTCH-2584 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.14 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.15 > > > Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core. > See > [howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt]. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies
[ https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ralf updated NUTCH-2583: Description: Hi, It would be nice to be able to upgrade all of Nutch's dependencies to the latest possible available. I've attached an Ivy.xml with the latest possible dependencies without breaking the compile. I've tested it with a few runs of the "crawl script", so far it seems to work, it generates, it fetches, it parses, it indexes to Solr. Increasing any of this dependencies breaks the compile. PS: I haven't touched any of the Hadoop stuff and don't remember if I touched the testing part or not. was: Hi, It would be nice to be able to upgrade all of Nutch's dependencies to the latest possible available. I've attached an Ivy.xml with the latest possible dependencies without breaking the compile. I've tested it with a few runs of the "crawl script", so far it seems to work, it generates, it fetches, it parses, it indexes to Solr. > Upgrading Nutch's dependencies > -- > > Key: NUTCH-2583 > URL: https://issues.apache.org/jira/browse/NUTCH-2583 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.14 >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > Attachments: ivy.xml > > > Hi, > > It would be nice to be able to upgrade all of Nutch's dependencies to the > latest possible available. > I've attached an Ivy.xml with the latest possible dependencies without > breaking the compile. I've tested it with a few runs of the "crawl script", > so far it seems to work, it generates, it fetches, it parses, it indexes to > Solr. Increasing any of this dependencies breaks the compile. > > PS: I haven't touched any of the Hadoop stuff and don't remember if I touched > the testing part or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies
[ https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ralf updated NUTCH-2583: Description: Hi, It would be nice to be able to upgrade all of Nutch's dependencies to the latest possible available. I've attached an Ivy.xml with the latest possible dependencies without breaking the compile. I've tested it with a few runs of the "crawl script", so far it seems to work, it generates, it fetches, it parses, it indexes to Solr. was: Hi, It would be nice to be able to upgrade all of Nutch's dependencies to the latest possible available. > Upgrading Nutch's dependencies > -- > > Key: NUTCH-2583 > URL: https://issues.apache.org/jira/browse/NUTCH-2583 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.14 >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > Attachments: ivy.xml > > > Hi, > > It would be nice to be able to upgrade all of Nutch's dependencies to the > latest possible available. > I've attached an Ivy.xml with the latest possible dependencies without > breaking the compile. I've tested it with a few runs of the "crawl script", > so far it seems to work, it generates, it fetches, it parses, it indexes to > Solr. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies
[ https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ralf updated NUTCH-2583: Attachment: ivy.xml > Upgrading Nutch's dependencies > -- > > Key: NUTCH-2583 > URL: https://issues.apache.org/jira/browse/NUTCH-2583 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.14 >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > Attachments: ivy.xml > > > Hi, > > It would be nice to be able to upgrade all of Nutch's dependencies to the > latest possible available. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2583) Upgrading Nutch's dependencies
Ralf created NUTCH-2583: --- Summary: Upgrading Nutch's dependencies Key: NUTCH-2583 URL: https://issues.apache.org/jira/browse/NUTCH-2583 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.14 Reporter: Ralf Fix For: 1.15 Hi, It would be nice to be able to upgrade all of Nutch's dependencies to the latest possible available. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2290) Update licenses of bundled libraries
[ https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488984#comment-16488984 ] Ralf commented on NUTCH-2290: - I've got an Ivy.xml with updated depencies, as far up as possible without breaking the compile, don't know about the rest so far it seemed to work on a few trial runs with the crawl script > Update licenses of bundled libraries > > > Key: NUTCH-2290 > URL: https://issues.apache.org/jira/browse/NUTCH-2290 > Project: Nutch > Issue Type: Bug > Components: deployment >Affects Versions: 2.3.1, 1.12 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.15 > > > The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should > be updated to include all licenses of dependencies (and their dependencies) > in accordance to [Assembling LICENSE and NOTICE > HOWTO|http://www.apache.org/dev/licensing-howto.html]: > # check for missing or obsolete licenses due to added or removed dependencies > # update year in NOTICE.txt -- should be a range according to the licensing > HOWTO > # bundled libraries are referenced with path and version number, e.g > {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with > every dependency upgrade. A more generic reference ("ICU4J") would be easier > to maintain but the HOWTO requires to "specify the version of the dependency > as licenses are sometimes changed". > # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, > according to the HOWTO there is no need to repeat the Apache license again > and again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2512) Nutch 1.14 does not work under JDK9
[ https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484547#comment-16484547 ] Ralf commented on NUTCH-2512: - Hi, I'm really curios.. I've been taking apart the 1.14 Source these last few days How do you update to a new Java Version? > Nutch 1.14 does not work under JDK9 > --- > > Key: NUTCH-2512 > URL: https://issues.apache.org/jira/browse/NUTCH-2512 > Project: Nutch > Issue Type: Bug > Components: build, injector >Affects Versions: 1.14 > Environment: Ubuntu 16.04 (All patches up to 02/20/2018) > Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018) >Reporter: Ralf >Priority: Major > Fix For: 1.15 > > > Nutch 1.14 (Source) does not compile properly under JDK 9 > Nutch 1.14 (Binary) does not function under Java 9 > > When trying to Nuild Nutch, Ant complains about missing Sonar files then > exits with: > "BUILD FAILED > /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" " > > Once having commented out the "offending code" the Build finishes but the > resulting Binary fails to function (as well as the Apache Compiled Binary > distribution), Both exit with: > > Injecting seed URLs > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Injector: starting at 2018-02-21 02:02:16 > Injector: crawlDb: searchcrawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method > sun.security.krb5.Config.getInstance() > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > Injector: java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413) > at > org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.Injector.inject(Injector.java:417) > at org.apache.nutch.crawl.Injector.run(Injector.java:563) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.Injector.main(Injector.java:528) > > Error running: > /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ > Failed with exit value 255. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2290) Update licenses of bundled libraries
[ https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484521#comment-16484521 ] Ralf commented on NUTCH-2290: - Hi, Shouldn't we upgrade ALL dependencies first? There some that are very old. > Update licenses of bundled libraries > > > Key: NUTCH-2290 > URL: https://issues.apache.org/jira/browse/NUTCH-2290 > Project: Nutch > Issue Type: Bug > Components: deployment >Affects Versions: 2.3.1, 1.12 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.15 > > > The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should > be updated to include all licenses of dependencies (and their dependencies) > in accordance to [Assembling LICENSE and NOTICE > HOWTO|http://www.apache.org/dev/licensing-howto.html]: > # check for missing or obsolete licenses due to added or removed dependencies > # update year in NOTICE.txt -- should be a range according to the licensing > HOWTO > # bundled libraries are referenced with path and version number, e.g > {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with > every dependency upgrade. A more generic reference ("ICU4J") would be easier > to maintain but the HOWTO requires to "specify the version of the dependency > as licenses are sometimes changed". > # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, > according to the HOWTO there is no need to repeat the Apache license again > and again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2512) Nutch 1.14 does not work under JDK9
Ralf created NUTCH-2512: --- Summary: Nutch 1.14 does not work under JDK9 Key: NUTCH-2512 URL: https://issues.apache.org/jira/browse/NUTCH-2512 Project: Nutch Issue Type: Bug Components: build, injector Affects Versions: 1.14 Environment: Ubuntu 16.04 (All patches up to 02/20/2018) Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018) Reporter: Ralf Nutch 1.14 (Source) does not compile properly under JDK 9 Nutch 1.14 (Binary) does not function under Java 9 When trying to Nuild Nutch, Ant complains about missing Sonar files then exits with: "BUILD FAILED /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" " Once having commented out the "offending code" the Build finishes but the resulting Binary fails to function (as well as the Apache Compiled Binary distribution), Both exit with: Injecting seed URLs /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ Injector: starting at 2018-02-21 02:02:16 Injector: crawlDb: searchcrawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method sun.security.krb5.Config.getInstance() WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Injector: java.lang.NullPointerException at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413) at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) at org.apache.nutch.crawl.Injector.inject(Injector.java:417) at org.apache.nutch.crawl.Injector.run(Injector.java:563) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:528) Error running: /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/ Failed with exit value 255. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-1773) Solr Indexer fails
Ralf created NUTCH-1773: --- Summary: Solr Indexer fails Key: NUTCH-1773 URL: https://issues.apache.org/jira/browse/NUTCH-1773 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 2.3 Environment: Ubuntu 12.04 LTS, java version 1.7.0_55 - Hbase-0.90.6 (pseudo dist), Hadoop 1.2.1, Solr 4.6 Reporter: Ralf Priority: Critical Fix For: 2.3 When using crawl script or solrindexer by itself (/bin/nutch solrindex) in localmode it fails with: hduser@bl4ck1c3:~/nutch-2.3/runtime/local$ bin/nutch solrindex TestCrawl18 -reindex IndexingJob: starting Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication SolrIndexerJob: java.lang.IllegalStateException: Target host must not be null, or set in parameters. at org.apache.http.impl.client.DefaultRequestDirector.determineRoute(DefaultRequestDirector.java:787) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:414) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:393) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168) at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:146) at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:127) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:171) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:187) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:196) when using the new INDEX command it finishes, but nothing is added to Solr: hduser@bl4ck1c3:~/nutch-2.3/runtime/local$ bin/nutch index TestCrawl18 -reindex IndexingJob: starting Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Log shows: 2014-05-13 03:01:13,781 INFO indexer.IndexingJob - IndexingJob: starting 2014-05-13 03:01:14,108 INFO indexer.IndexingFilters - Adding org.apache.nutch.analysis.lang.LanguageIndexingFilter 2014-05-13 03:01:14,109 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-05-13 03:01:14,109 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-05-13 03:01:14,335 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-05-13 03:01:14,336 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-05-13 03:01:14,336 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-05-13 03:01:14,620 WARN zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable 2014-05-13 03:01:14,768 WARN zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable 2014-05-13 03:01:14,968 WARN zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable 2014-05-13 03:01:15,243 WARN zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable 2014-05-13 03:01:15,276 WARN zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable 2014-05-13 03:01:15,326 WARN zookeeper.ClientCnxnSocket - Connected to an old server; r-o mode will be unavailable 2014-05-13 03:01:15,386 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2014-05-13
[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993566#comment-13993566 ] Ralf commented on NUTCH-1714: - Did something change regarding Solr Indexer? first it sayd that no indexer defined, then when I added solarindexer to plugins, it says something to the regards off: solr url has not to be null or set in parameters... still same messg when hardcoding solr url on nutch-site.xml Nutch 2.x upgrade to Gora 0.4 - Key: NUTCH-1714 URL: https://issues.apache.org/jira/browse/NUTCH-1714 Project: Nutch Issue Type: Improvement Reporter: Alparslan Avcı Assignee: Alparslan Avcı Fix For: 2.3 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992728#comment-13992728 ] Ralf commented on NUTCH-1679: - HI, I would love to participate, how can I check out the 2.3 code so I can test? Thank you! UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1770) Nutch is failing to parse all PDFs
[ https://issues.apache.org/jira/browse/NUTCH-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993766#comment-13993766 ] Ralf commented on NUTCH-1770: - I just compiled the 2.x branch, no problems parsing PDF's here. Nutch is failing to parse all PDFs -- Key: NUTCH-1770 URL: https://issues.apache.org/jira/browse/NUTCH-1770 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.3 Environment: FreeBSD 10, Open JDK 8 Reporter: Rogério Pereira Araújo Priority: Critical Fix For: 2.3 I'm trying to craw a filesystem directory containing several PDFs, but when the parsing stage starts, I'm getting the error described on ticket PDFBOX-1122 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992939#comment-13992939 ] Ralf commented on NUTCH-1714: - OK, what do I have to do in order to use Gora 0.4? which version of Hbase? 0.94.19? Nutch 2.x upgrade to Gora 0.4 - Key: NUTCH-1714 URL: https://issues.apache.org/jira/browse/NUTCH-1714 Project: Nutch Issue Type: Improvement Reporter: Alparslan Avcı Assignee: Alparslan Avcı Fix For: 2.3 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993547#comment-13993547 ] Ralf commented on NUTCH-1679: - Checked out revision 1593523. UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993545#comment-13993545 ] Ralf commented on NUTCH-1679: - OK, I got it - I guess that whatever is downloaded there has patches applied exept those from the Open issues UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)