Hi all,
I've been successfully using nutch trunk from the end of January. Since
I've been encountering errors (in dedup) when indexing huge segments, I
decided to sync to the head (as of today), hopping the current version
would solve my problems. Unfortunately, since the update I haven't been
able to even run a crawl successfully.
Regarding my setup, I'm using 4 machines: 1 namenode/jobtracker and 3
datanodes/tasktrackers.
Here is what I did to transition to the latest nutch (as of today):
- Modified *conf/hadoop-env.sh* to reflect my environment
- Added my slaves in *conf/slaves
*- Moved all of the *ndfs, fs, mapred *properties* *in*
conf/nutch-site.xml* to *conf/hadoop-site.xml* and renamed all *ndfs*
references to *dfs
*By doing this, I'm able to successfully start the namenode and
jobtracker on the master and the tasktrackers and datanodes on the
slaves. No errors message, everything looks good. I can query/run
hadoop and check that my data is visible, etc.
However, when I start a "nutch" operation (fetch, inject, etc.), it
never succeeds.
For example, after issuing "nutch inject crawldb urls", I get this error:
060306 184123 Running job: job_hyhtho
060306 184124 map 0% reduce 0%
060306 184137 map 1% reduce 0%
060306 184142 map 2% reduce 0%
060306 184150 map 1% reduce 0%
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
at org.apache.nutch.crawl.Injector.main(Injector.java:138)
In hadoop jobtracker's log, I can see several tasks being losts as follow:
060306 184155 Aborting job job_hyhtho
060306 184156 Task 'task_m_7qgat2' has been lost.
060306 184156 Aborting job job_hyhtho
060306 184156 Task 'task_m_lph5qs' has been lost.
060306 184156 Aborting job job_hyhtho
It seems there are some sort of timeouts. Weird, the machines are
properly configured (hasn't changed) and it definitely works w/ nutch
the previous nutch version (as of end of Jan.).
On the slaves though, I get some file not found error (??):
060306 184316 Server handler 3 on 50040 caught:
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_3blouu/part-0.out
java.io.FileNotFoundException:
/var/epile/nutch/mapred/local/task_m_3blouu/part-0.out
at
org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:113)
at
org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
at
org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
at
org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:106)
at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215)
I'm a bit puzzled since I didn't change anything on my data in dfs.
Also, I kept the same settings I used to use (ports#, hostnames, etc.).
I guess I must be missing something or perhaps some properties need to
go to some different files?
Originally I kept most of the settings in *nutch-site.xml* but it didn't
work. I also tried to put some settings in *map-default.xml*, but it
didn't work either. I poked around quite a bit, but I couldn't get it
to work properly or at least to where it was before I sync'ed.
Any help would be greatly appreciated.
Thanks,
--Flo