Doug Cutting wrote:
Chris Schneider wrote:
I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.

I have no idea why this takes so long. We should profile this operation to figure out what's going on, because it shouldn't anywhere near that long. It should be easy to write a simple program that constructs a JobConf and InputFormat like those used in this job, and calls getSplits(). Then profile this as a standalone program to see where the time is going. Probably you don't really want to profile something that takes 14 minutes, so perhaps profile it on a subset of the input.

I have an idea - you remember the old issue of MapFile's "index" being corrupted, if Fetcher was interrupted. Random accesses to MapFile's would take ages in that case. Does calculating splits involve random access to the segment's MapFiles?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to