Hi, I was unable to reproduce the linkdb error.
The NSIDC ADE 403 forbidden error occurs because NSIDC seems to be blocking User-Agent's containing "nutch" in them. -- Thanks, Veeresh On 20 February 2015 at 15:26, Shuo Li <[email protected]> wrote: > Hi, > > I'm trying to crawl NSF ACADIS with nutch-selenium. I meet a problem *with > linkdb/current/part-00000/data > does not exist. *I checked my directory and my files during crawling, and > it appears this file sometimes exist and sometimes disappear. This is quite > weird and stranger. > > Another problem is when we crawl NSIDC ADE, it will give us a 403 > forbidden error. Does this mean NSIDC ADE is blocking us? > > The log of first error is in the bottom of this email. Any help would be > appreciated. > > Regards, > Shuo Li > > > > > > LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb > LinkDb: java.io.FileNotFoundException: File > file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-00000/data > does not exist. > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) > at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276) >

