RE: Why did my crawl fail?

Arkadi.Kosmynin Sun, 26 Jul 2009 18:07:44 -0700

This is a very interesting issue. I guess that absence of parse_data means that 
no content has been fetched. Am I wrong?


This happened in my crawls a few times. Theoretically (I am guessing again) 
this may happen if all urls selected for fetching on this iteration are either 
blocked by the filters, or failed to be fetched, for whatever reason.

I got around this problem by checking for presence of parse_data, and if it is 
absent, deleting the segment. This seems to be working, but I am not 100% sure 
that this is a good thing to do. Can I do this? Is it safe to do? Would 
appreciate if someone with expert knowledge commented on this issue.

Regards,

Arkadi


> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Paul
> Tomblin
> Sent: Saturday, July 25, 2009 12:54 AM
> To: nutch-user
> Subject: Why did my crawl fail?
> 
> I installed nutch 1.0 on my laptop last night and set it running to crawl
> my
> blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> it was still running strong when I went to bed several hours later, and
> this
> morning I woke up to this:
> 
> activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl.blog/crawldb
> CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl.blog/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> LinkDb: adding segment:
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/Users/ptomblin/nutch-
> 1.0/crawl.blog/segments/20090723154530/parse_data
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> 79)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> putFormat.java:39)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
> 0)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> 
> 
> --
> http://www.linkedin.com/in/paultomblin

RE: Why did my crawl fail?

Reply via email to