[ https://issues.apache.org/jira/browse/NUTCH-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433638#comment-16433638 ]
Sebastian Nagel commented on NUTCH-2533: ---------------------------------------- Nutch resp. the Hadoop [FileInputFormat|http://hadoop.apache.org/docs/r2.8.2/api/index.html?org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html] expects either a single file or a directory containing files as input for seed URLs. To make it recursively read input files would allow to many potential errors, esp. if the root directory {{/}} is passed as seed directory: it contains many special files (devices, named pipes, etc.) and you hardly have the permissions to read the entire directory tree. But agreed: the error message resp. stack trace isn't really informative for Nutch/Hadoop newbies. I'll prepare a fix to improve the error message and log all non-file inputs in the seed directory. Thanks! > Injector: NullPointerException if seed URL dir contains non-file entries > ------------------------------------------------------------------------ > > Key: NUTCH-2533 > URL: https://issues.apache.org/jira/browse/NUTCH-2533 > Project: Nutch > Issue Type: Bug > Components: injector > Affects Versions: 2.3.1, 1.14 > Reporter: Krzysztof Madejski > Assignee: Sebastian Nagel > Priority: Blocker > Fix For: 2.4, 1.15 > > > I'm following https://wiki.apache.org/nutch/Nutch2Tutorial > > I've run `./nutch inject /` and I've got the following error: > {noformat} > InjectorJob: starting at 2018-03-12 11:59:05 > InjectorJob: Injecting urlDir: / > InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora > storage class. > InjectorJob: java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:442) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:411) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493) > at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303) > at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:115) > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231) > at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252) > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)