What command are you using to crawl? Are you using bin/crawl, and/or doing incremental crawling?
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Shuo Li <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, February 20, 2015 at 3:26 PM To: "[email protected]" <[email protected]> Subject: linkdb/current/part-00000/data does not exist >Hi, > > >I'm trying to crawl NSF ACADIS with nutch-selenium. I meet a problem >with linkdb/current/part-00000/data does not exist. I checked my >directory and my files during crawling, and it appears this file >sometimes exist and sometimes disappear. This is quite weird and stranger. > > >Another problem is when we crawl NSIDC ADE, it will give us a 403 >forbidden error. Does this mean NSIDC ADE is blocking us? > > >The log of first error is in the bottom of this email. Any help would be >appreciated. > > >Regards, >Shuo Li > > > > > > > > > > >LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb >LinkDb: java.io.FileNotFoundException: File >file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-0000 >0/data does not exist. >at >org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j >ava:402) >at >org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java: >255) >at >org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn >putFormat.java:47) >at >org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20 >8) >at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) >at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) >at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) >at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) >at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) >at java.security.AccessController.doPrivileged(Native Method) >at javax.security.auth.Subject.doAs(Subject.java:415) >at >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation. >java:1190) >at >org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) >at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) >at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) >at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208) >at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316) >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276) >

