Unfortunately I blew away those particular logs when I fetched the svn trunk. I just tried it again (well, I started it again at noon and it just finished) and this time it worked fine, so it seems kind of heisenbug-like. Maybe it has something to do with which pages are types it can't handle?
On Mon, Jul 27, 2009 at 11:27 AM, xiao yang <[email protected]> wrote: > Hi, Paul > > Can you post the error messages in the log file > (file:/Users/ptomblin/nutch-1.0/logs)? > > On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin<[email protected]> wrote: > > Actually, I got that error the first time I used it, and then again when > I > > blew away the downloaded nutch and grabbed the latest trunk from > Subversion. > > > > On Mon, Jul 27, 2009 at 1:11 AM, xiao yang <[email protected]> > wrote: > > > >> You must have crawled for several times, and some of them failed > >> before the parse phase. So the parse data was not generated. > >> You'd better delete the whole directory > >> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you > >> will know the exact reason why it failed in the parse phase from the > >> output information. > >> > >> Xiao > >> > >> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<[email protected]> > wrote: > >> > I installed nutch 1.0 on my laptop last night and set it running to > crawl > >> my > >> > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 > >> > it was still running strong when I went to bed several hours later, > and > >> this > >> > morning I woke up to this: > >> > > >> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > >> > -activeThreads=0 > >> > Fetcher: done > >> > CrawlDb update: starting > >> > CrawlDb update: db: crawl.blog/crawldb > >> > CrawlDb update: segments: [crawl.blog/segments/20090724010303] > >> > CrawlDb update: additions allowed: true > >> > CrawlDb update: URL normalizing: true > >> > CrawlDb update: URL filtering: true > >> > CrawlDb update: Merging segment data into db. > >> > CrawlDb update: done > >> > LinkDb: starting > >> > LinkDb: linkdb: crawl.blog/linkdb > >> > LinkDb: URL normalize: true > >> > LinkDb: URL filter: true > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 > >> > LinkDb: adding segment: > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 > >> > Exception in thread "main" > >> org.apache.hadoop.mapred.InvalidInputException: > >> > Input path does not exist: > >> > > >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data > >> > at > >> > > >> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > >> > at > >> > > >> > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > >> > at > >> > > >> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > >> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > >> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > >> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > >> > > >> > > >> > -- > >> > http://www.linkedin.com/in/paultomblin > >> > > >> > > > > > > > > -- > > http://www.linkedin.com/in/paultomblin > > > -- http://www.linkedin.com/in/paultomblin
