Thank you, Thomas. That's a small change in 0.8 that I overlooked.
Nutch crawl now progresses to a further step.
But it still stalls with an IOException, like show below. Any further
insight?
(I re-ran the same command after removing the tmp directory and the
index directory,
but I hit the same exception.)
-kuro
$ ./bin/nutch crawl test/urls -dir test/thoreau-index -depth 2 2>&1 |
tee crawl-thoreau-060605-log.txt
060605 103451 Running job: job_yaocyb
060605 103451
C:/opt/nutch-060531/test/thoreau-index/crawldb/current/part-00000/data:0
+125
060605 103451
C:/opt/nutch-060531/test/thoreau-index/segments/20060605103443/crawl_fet
ch/part-00000/data:0+141
060605 103451
C:/opt/nutch-060531/test/thoreau-index/segments/20060605103443/crawl_par
se/part-00000:0+748
060605 103451 job_yaocyb
java.io.IOException: Target
/tmp/hadoop/mapred/local/reduce_yv2ar3/map_ynynnj.out already exists
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
at
org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)
Exception in thread "main"
> -----Original Message-----
> From: TDLN [mailto:[EMAIL PROTECTED]
> Sent: 2006-6-03 1:30
> To: [email protected]
> Subject: Re: help running 5/31 version of nightly build
>
> The syntax for the crawl command is
>
> Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]
>
> So your first parameter should point to the *directory* containing the
> file with seed urls, not the file itself.
>
> Please fix your syntax and try again.
>
> Rgrds, Thomas
>
> On 6/3/06, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:
> > I tried to run the May 31 version of the nightly build but
> it failed.
> > It has something to do with the "job", which I thought would not be
> > needed
> > if I just need to run on a regular file system. Why does
> Nutch try to
> > use Hadoop in the default configuration? Is it necessary?
> >
> > -kuro
> >
> > $ ./bin/nutch crawl test/thoreau-url.txt -dir
> test/thoreau-index -depth
> > 2
> > 060602 170942 parsing
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general