[jira] [Commented] (NUTCH-2533) Injector: NullPointerException if seed URL dir contains non-file entries

Sebastian Nagel (JIRA) Wed, 11 Apr 2018 02:46:21 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433638#comment-16433638
 ]


Sebastian Nagel commented on NUTCH-2533:
----------------------------------------

Nutch resp. the Hadoop 
[FileInputFormat|http://hadoop.apache.org/docs/r2.8.2/api/index.html?org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html]
 expects either a single file or a directory containing files as input for seed 
URLs. To make it recursively read input files would allow to many potential 
errors, esp. if the root directory {{/}} is passed as seed directory: it 
contains many special files (devices, named pipes, etc.) and you hardly have 
the permissions to read the entire directory tree. But agreed: the error 
message resp. stack trace isn't really informative for Nutch/Hadoop newbies. 
I'll prepare a fix to improve the error message and log all non-file inputs in 
the seed directory. Thanks!

> Injector: NullPointerException if seed URL dir contains non-file entries
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-2533
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2533
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 2.3.1, 1.14
>            Reporter: Krzysztof Madejski
>            Assignee: Sebastian Nagel
>            Priority: Blocker
>             Fix For: 2.4, 1.15
>
>
> I'm following https://wiki.apache.org/nutch/Nutch2Tutorial
>  
> I've run `./nutch inject /` and I've got the following error:
> {noformat}
> InjectorJob: starting at 2018-03-12 11:59:05
> InjectorJob: Injecting urlDir: /
> InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora 
> storage class.
> InjectorJob: java.lang.NullPointerException
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:442)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:411)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493)
> at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)
> at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
> at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:115)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231)
> at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
> at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2533) Injector: NullPointerException if seed URL dir contains non-file entries

Reply via email to