What command are you using to crawl? Are you using bin/crawl, and/or
doing incremental crawling?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Shuo Li <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, February 20, 2015 at 3:26 PM
To: "[email protected]" <[email protected]>
Subject: linkdb/current/part-00000/data does not exist

>Hi,
>
>
>I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
>with linkdb/current/part-00000/data does not exist. I checked my
>directory and my files during crawling, and it appears this file
>sometimes exist and sometimes disappear. This is quite weird and stranger.
>
>
>Another problem is when we crawl NSIDC ADE, it will give us a 403
>forbidden error. Does this mean NSIDC ADE is blocking us?
>
>
>The log of first error is in the bottom of this email. Any help would be
>appreciated.
>
>
>Regards,
>Shuo Li
>
>
>
>
>
>
>
>
>
>
>LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
>LinkDb: java.io.FileNotFoundException: File
>file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-0000
>0/data does not exist.
>at 
>org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j
>ava:402)
>at 
>org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:
>255)
>at 
>org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
>putFormat.java:47)
>at 
>org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
>8)
>at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
>at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
>at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:415)
>at 
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
>java:1190)
>at 
>org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
>at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
>at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
>

Reply via email to