[Nutch-general] issues w/ "new" nutch versions

Florent Gluck Mon, 06 Mar 2006 15:53:02 -0800

Hi all,

I've been successfully using nutch trunk from the end of January.  Since
I've been encountering errors (in dedup) when indexing huge segments, I
decided to sync to the head (as of today), hopping the current version
would solve my problems.  Unfortunately, since the update I haven't been
able to even run a crawl successfully.
Regarding my setup, I'm using 4 machines: 1 namenode/jobtracker and 3
datanodes/tasktrackers.
Here is what I did to transition to the latest nutch (as of today):
- Modified *conf/hadoop-env.sh* to reflect my environment
- Added my slaves in *conf/slaves
*- Moved all of the *ndfs, fs, mapred *properties* *in*
conf/nutch-site.xml* to *conf/hadoop-site.xml* and renamed all *ndfs*
references to *dfs
*By doing this, I'm able to successfully start the namenode and
jobtracker  on the master and the tasktrackers and datanodes on the
slaves.  No errors message, everything looks good.  I can query/run
hadoop and check that my data is visible, etc.
However, when I start a "nutch" operation (fetch, inject, etc.), it
never succeeds.
For example, after issuing "nutch inject crawldb urls", I get this error:
060306 184123 Running job: job_hyhtho
060306 184124  map 0%  reduce 0%
060306 184137  map 1%  reduce 0%
060306 184142  map 2%  reduce 0%
060306 184150  map 1%  reduce 0%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
        at org.apache.nutch.crawl.Injector.main(Injector.java:138)


In hadoop jobtracker's log, I can see several tasks being losts as follow:
060306 184155 Aborting job job_hyhtho
060306 184156 Task 'task_m_7qgat2' has been lost.
060306 184156 Aborting job job_hyhtho
060306 184156 Task 'task_m_lph5qs' has been lost.
060306 184156 Aborting job job_hyhtho
It seems there are some sort of timeouts.  Weird, the machines are
properly configured (hasn't changed) and it definitely works w/ nutch
the previous nutch version (as of end of Jan.).

On the slaves though, I get some file not found error (??):
060306 184316 Server handler 3 on 50040 caught:
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_3blouu/part-0.out
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_3blouu/part-0.out
        at
org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:113)
        at
org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
        at
org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
        at
org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:106)       
at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215)

I'm a bit puzzled since I didn't change anything on my data in dfs. 
Also, I kept the same settings I used to use (ports#, hostnames, etc.).
I guess I must be missing something or perhaps some properties need to
go to some different files?
Originally I kept most of the settings in *nutch-site.xml* but it didn't
work.  I also tried to put some settings in *map-default.xml*, but it
didn't work either.  I poked around quite a bit, but I couldn't get it
to work properly or at least to where it was before I sync'ed.
Any help would be greatly appreciated.

Thanks,
--Flo

[Nutch-general] issues w/ "new" nutch versions

Reply via email to