[jira] [Comment Edited] (NUTCH-2512) Nutch does not build under JDK9

2018-06-06 Thread Ralf (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503269#comment-16503269
 ] 

Ralf edited comment on NUTCH-2512 at 6/6/18 1:45 PM:
-

I just compiled master/trunk on a VM-Box with Ubuntu Bionic and Oracle Java 
10.1 - It trows a couple of warnings, but compiles and I have it doing a small 
crawl right now and so far so good. Nutch now no longer takes the Solr url from 
the commandline, this should be reflected in the tutorials and docs by the time 
1.15 gets released. (I still can't compile Nutch with Tika 1.18 on my Java 8 
set-up, it works when I revert to Tika 1.17, I wonder what could be wrong with 
my Java set-up)...

 

Correction - actually it doesn't index to Solr and fails with:

at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://192.168.178.20:8983/solr/#/nutch: Expected mime type 
application/octet-stream but got text/html. 


Error 405 HTTP method POST is not supported by this URL

HTTP ERROR 405
Problem accessing /solr/index.html. Reason:
 HTTP method POST is not supported by this URL



 


was (Author: bl4ck1c3):
I just compiled master/trunk on a VM-Box with Ubuntu Bionic and Oracle Java 
10.1 - It trows a couple of warnings, but compiles and I have it doing a small 
crawl right now and so far so good. Nutch now no longer takes the Solr url from 
the commandline, this should be reflected in the tutorials and docs by the time 
1.15 gets released. (I still can't compile Nutch with Tika 1.18 on my Java 8 
set-up, it works when I revert to Tika 1.17, I wonder what could be wrong with 
my Java set-up)

> Nutch does not build under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at 

[jira] [Commented] (NUTCH-2512) Nutch does not build under JDK9

2018-06-06 Thread Ralf (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503269#comment-16503269
 ] 

Ralf commented on NUTCH-2512:
-

I just compiled master/trunk on a VM-Box with Ubuntu Bionic and Oracle Java 
10.1 - It trows a couple of warnings, but compiles and I have it doing a small 
crawl right now and so far so good. Nutch now no longer takes the Solr url from 
the commandline, this should be reflected in the tutorials and docs by the time 
1.15 gets released. (I still can't compile Nutch with Tika 1.18 on my Java 8 
set-up, it works when I revert to Tika 1.17, I wonder what could be wrong with 
my Java set-up)

> Nutch does not build under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:528)
>  
> Error running:
>   /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2584) Upgrade parse-tika to use Tika 1.18

2018-05-25 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491041#comment-16491041
 ] 

Ralf commented on NUTCH-2584:
-

Hi,

 

Just tried it, still get the same error at compile time.

 

 

> Upgrade parse-tika to use Tika 1.18
> ---
>
> Key: NUTCH-2584
> URL: https://issues.apache.org/jira/browse/NUTCH-2584
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core.
> See 
> [howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2584) Upgrade parse-tika to use Tika 1.18

2018-05-24 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489311#comment-16489311
 ] 

Ralf commented on NUTCH-2584:
-

Hi,

Tried this.. for me it does not work. Compiler exits with:

[ivy:resolve]  ERRORS
[ivy:resolve] impossible to get artifacts when data has not been loaded. 
IvyNode = javax.measure#unit-api;1.0

> Upgrade parse-tika to use Tika 1.18
> ---
>
> Key: NUTCH-2584
> URL: https://issues.apache.org/jira/browse/NUTCH-2584
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> Tika 1.18 is released and NUTCH-2583 includes and upgrade of tika-core.
> See 
> [howto_upgrade_tika|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf updated NUTCH-2583:

Description: 
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

I've attached an Ivy.xml with the latest possible dependencies without breaking 
the compile. I've tested it with a few runs of the "crawl script", so far it 
seems to work, it generates, it fetches, it parses, it indexes to Solr. 
Increasing any of this dependencies breaks the compile.

 

PS: I haven't touched any of the Hadoop stuff and don't remember if I touched 
the testing part or not.

  was:
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

I've attached an Ivy.xml with the latest possible dependencies without breaking 
the compile. I've tested it with a few runs of the "crawl script", so far it 
seems to work, it generates, it fetches, it parses, it indexes to Solr. 


> Upgrading Nutch's dependencies
> --
>
> Key: NUTCH-2583
> URL: https://issues.apache.org/jira/browse/NUTCH-2583
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.14
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml
>
>
> Hi,
>  
> It would be nice to be able to upgrade all of Nutch's dependencies to the 
> latest possible available.
> I've attached an Ivy.xml with the latest possible dependencies without 
> breaking the compile. I've tested it with a few runs of the "crawl script", 
> so far it seems to work, it generates, it fetches, it parses, it indexes to 
> Solr. Increasing any of this dependencies breaks the compile.
>  
> PS: I haven't touched any of the Hadoop stuff and don't remember if I touched 
> the testing part or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf updated NUTCH-2583:

Description: 
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

I've attached an Ivy.xml with the latest possible dependencies without breaking 
the compile. I've tested it with a few runs of the "crawl script", so far it 
seems to work, it generates, it fetches, it parses, it indexes to Solr. 

  was:
Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

 


> Upgrading Nutch's dependencies
> --
>
> Key: NUTCH-2583
> URL: https://issues.apache.org/jira/browse/NUTCH-2583
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.14
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml
>
>
> Hi,
>  
> It would be nice to be able to upgrade all of Nutch's dependencies to the 
> latest possible available.
> I've attached an Ivy.xml with the latest possible dependencies without 
> breaking the compile. I've tested it with a few runs of the "crawl script", 
> so far it seems to work, it generates, it fetches, it parses, it indexes to 
> Solr. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ralf updated NUTCH-2583:

Attachment: ivy.xml

> Upgrading Nutch's dependencies
> --
>
> Key: NUTCH-2583
> URL: https://issues.apache.org/jira/browse/NUTCH-2583
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.14
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
> Attachments: ivy.xml
>
>
> Hi,
>  
> It would be nice to be able to upgrade all of Nutch's dependencies to the 
> latest possible available.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2583) Upgrading Nutch's dependencies

2018-05-24 Thread Ralf (JIRA)
Ralf created NUTCH-2583:
---

 Summary: Upgrading Nutch's dependencies
 Key: NUTCH-2583
 URL: https://issues.apache.org/jira/browse/NUTCH-2583
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.14
Reporter: Ralf
 Fix For: 1.15


Hi,

 

It would be nice to be able to upgrade all of Nutch's dependencies to the 
latest possible available.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2290) Update licenses of bundled libraries

2018-05-24 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488984#comment-16488984
 ] 

Ralf commented on NUTCH-2290:
-

I've got an Ivy.xml with updated depencies, as far up as possible without 
breaking the compile, don't know about the rest so far it seemed to work on 
a few trial runs with the crawl script

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2512) Nutch 1.14 does not work under JDK9

2018-05-22 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484547#comment-16484547
 ] 

Ralf commented on NUTCH-2512:
-

Hi,

 

I'm really curios.. I've been taking apart the 1.14 Source these last few 
days

How do you update to a new Java Version?

> Nutch 1.14 does not work under JDK9
> ---
>
> Key: NUTCH-2512
> URL: https://issues.apache.org/jira/browse/NUTCH-2512
> Project: Nutch
>  Issue Type: Bug
>  Components: build, injector
>Affects Versions: 1.14
> Environment: Ubuntu 16.04 (All patches up to 02/20/2018)
> Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
>Reporter: Ralf
>Priority: Major
> Fix For: 1.15
>
>
> Nutch 1.14 (Source) does not compile properly under JDK 9
> Nutch 1.14 (Binary) does not function under Java 9
>  
> When trying to Nuild Nutch, Ant complains about missing Sonar files then 
> exits with:
> "BUILD FAILED
> /home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
>  
> Once having commented out the "offending code" the Build finishes but the 
> resulting Binary fails to function (as well as the Apache Compiled Binary 
> distribution), Both exit with:
>  
> Injecting seed URLs
> /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Injector: starting at 2018-02-21 02:02:16
> Injector: crawlDb: searchcrawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> (file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
> sun.security.krb5.Config.getInstance()
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> Injector: java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
>         at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
>         at 
> org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>         at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>         at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>         at java.base/java.security.AccessController.doPrivileged(Native 
> Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:563)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:528)
>  
> Error running:
>   /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
> Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2290) Update licenses of bundled libraries

2018-05-22 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484521#comment-16484521
 ] 

Ralf commented on NUTCH-2290:
-

Hi,

 

Shouldn't we upgrade ALL dependencies first? There some that are very old.

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.15
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2512) Nutch 1.14 does not work under JDK9

2018-02-22 Thread Ralf (JIRA)
Ralf created NUTCH-2512:
---

 Summary: Nutch 1.14 does not work under JDK9
 Key: NUTCH-2512
 URL: https://issues.apache.org/jira/browse/NUTCH-2512
 Project: Nutch
  Issue Type: Bug
  Components: build, injector
Affects Versions: 1.14
 Environment: Ubuntu 16.04 (All patches up to 02/20/2018)

Oracle Java 9 - Oracle JDK 9 (Latest as off 02/22/2018)
Reporter: Ralf


Nutch 1.14 (Source) does not compile properly under JDK 9

Nutch 1.14 (Binary) does not function under Java 9

 

When trying to Nuild Nutch, Ant complains about missing Sonar files then exits 
with:
"BUILD FAILED
/home/nutch/nutch/build.xml:79: Unparseable date: "01/25/1971 2:00 pm" "
 
Once having commented out the "offending code" the Build finishes but the 
resulting Binary fails to function (as well as the Apache Compiled Binary 
distribution), Both exit with:
 
Injecting seed URLs
/home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
Injector: starting at 2018-02-21 02:02:16
Injector: crawlDb: searchcrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by 
org.apache.hadoop.security.authentication.util.KerberosUtil 
(file:/home/nutch/nutch2/lib/hadoop-auth-2.7.4.jar) to method 
sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of 
org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
Injector: java.lang.NullPointerException
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
        at 
org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
        at org.apache.nutch.crawl.Injector.run(Injector.java:563)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:528)
 
Error running:
  /home/nutch/nutch2/bin/nutch inject searchcrawl//crawldb urls/
Failed with exit value 255.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-1773) Solr Indexer fails

2014-05-15 Thread Ralf (JIRA)
Ralf created NUTCH-1773:
---

 Summary: Solr Indexer fails
 Key: NUTCH-1773
 URL: https://issues.apache.org/jira/browse/NUTCH-1773
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.3
 Environment: Ubuntu 12.04 LTS, java version 1.7.0_55 - Hbase-0.90.6 
(pseudo dist), Hadoop 1.2.1, Solr 4.6
Reporter: Ralf
Priority: Critical
 Fix For: 2.3


When using crawl script or solrindexer by itself (/bin/nutch solrindex) in 
localmode it fails with:

hduser@bl4ck1c3:~/nutch-2.3/runtime/local$ bin/nutch solrindex TestCrawl18 
-reindex
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication


SolrIndexerJob: java.lang.IllegalStateException: Target host must not be null, 
or set in parameters.
at 
org.apache.http.impl.client.DefaultRequestDirector.determineRoute(DefaultRequestDirector.java:787)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:414)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:393)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:146)
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:146)
at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:127)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:171)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:187)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:196)

when using the new INDEX command it finishes, but nothing is added to Solr:

hduser@bl4ck1c3:~/nutch-2.3/runtime/local$ bin/nutch index TestCrawl18 -reindex
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
 
Log shows:

2014-05-13 03:01:13,781 INFO  indexer.IndexingJob - IndexingJob: starting
2014-05-13 03:01:14,108 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.analysis.lang.LanguageIndexingFilter
2014-05-13 03:01:14,109 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-05-13 03:01:14,109 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-05-13 03:01:14,335 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-05-13 03:01:14,336 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-05-13 03:01:14,336 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-05-13 03:01:14,620 WARN  zookeeper.ClientCnxnSocket - Connected to an old 
server; r-o mode will be unavailable
2014-05-13 03:01:14,768 WARN  zookeeper.ClientCnxnSocket - Connected to an old 
server; r-o mode will be unavailable
2014-05-13 03:01:14,968 WARN  zookeeper.ClientCnxnSocket - Connected to an old 
server; r-o mode will be unavailable
2014-05-13 03:01:15,243 WARN  zookeeper.ClientCnxnSocket - Connected to an old 
server; r-o mode will be unavailable
2014-05-13 03:01:15,276 WARN  zookeeper.ClientCnxnSocket - Connected to an old 
server; r-o mode will be unavailable
2014-05-13 03:01:15,326 WARN  zookeeper.ClientCnxnSocket - Connected to an old 
server; r-o mode will be unavailable
2014-05-13 03:01:15,386 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-05-13 

[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-15 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993566#comment-13993566
 ] 

Ralf commented on NUTCH-1714:
-

Did something change regarding Solr Indexer? first it sayd that no indexer 
defined, then when I added solarindexer to plugins, it says something to the 
regards off: solr url has not to be null or set in parameters...  still same 
messg when hardcoding solr url on nutch-site.xml

 Nutch 2.x upgrade to Gora 0.4
 -

 Key: NUTCH-1714
 URL: https://issues.apache.org/jira/browse/NUTCH-1714
 Project: Nutch
  Issue Type: Improvement
Reporter: Alparslan Avcı
Assignee: Alparslan Avcı
 Fix For: 2.3

 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
 NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch


 Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
 details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-05-12 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992728#comment-13992728
 ] 

Ralf commented on NUTCH-1679:
-

HI,

I would love to participate, how can I check out the 2.3 code so I can test?

Thank you!

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1770) Nutch is failing to parse all PDFs

2014-05-12 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993766#comment-13993766
 ] 

Ralf commented on NUTCH-1770:
-

I just compiled the 2.x branch, no problems parsing PDF's here.

 Nutch is failing to parse all PDFs
 --

 Key: NUTCH-1770
 URL: https://issues.apache.org/jira/browse/NUTCH-1770
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.3
 Environment: FreeBSD 10, Open JDK 8
Reporter: Rogério Pereira Araújo
Priority: Critical
 Fix For: 2.3


 I'm trying to craw a filesystem directory containing several PDFs, but when 
 the parsing stage starts, I'm getting the error described on ticket 
 PDFBOX-1122



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-11 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992939#comment-13992939
 ] 

Ralf commented on NUTCH-1714:
-

OK, what do I have to do in order to use Gora 0.4? which version of Hbase? 
0.94.19?

 Nutch 2.x upgrade to Gora 0.4
 -

 Key: NUTCH-1714
 URL: https://issues.apache.org/jira/browse/NUTCH-1714
 Project: Nutch
  Issue Type: Improvement
Reporter: Alparslan Avcı
Assignee: Alparslan Avcı
 Fix For: 2.3

 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
 NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch


 Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
 details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-05-11 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993547#comment-13993547
 ] 

Ralf commented on NUTCH-1679:
-

Checked out revision 1593523.

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-05-10 Thread Ralf (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993545#comment-13993545
 ] 

Ralf commented on NUTCH-1679:
-

OK, I got it - I guess that whatever is downloaded there has patches applied 
exept those from the Open issues

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)