Chris Schneider wrote:
I was unable to even get the indexing phase started; I would get a timeout right at the 
beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but 
that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. 
As it turned out, it apparently took 14 minutes for it to "compute the splits". 
The job is still running (34% complete). Thus, it does seem like Doug was right about 
this being the problem.

I have no idea why this takes so long. We should profile this operation to figure out what's going on, because it shouldn't anywhere near that long. It should be easy to write a simple program that constructs a JobConf and InputFormat like those used in this job, and calls getSplits(). Then profile this as a standalone program to see where the time is going. Probably you don't really want to profile something that takes 14 minutes, so perhaps profile it on a subset of the input.

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to