Doug Cutting wrote:
Chris Schneider wrote:
I was unable to even get the indexing phase started; I would get a timeout right at the beginning. I tried increasing the ipc.client.timeout from 5 minutes to 10 minutes, but that didn't help. In desperation, I increased it to 30 minutes and went to walk the dogs. As it turned out, it apparently took 14 minutes for it to "compute the splits". The job is still running (34% complete). Thus, it does seem like Doug was right about this being the problem.

I have no idea why this takes so long. We should profile this operation to figure out what's going on, because it shouldn't anywhere near that long. It should be easy to write a simple program that constructs a JobConf and InputFormat like those used in this job, and calls getSplits(). Then profile this as a standalone program to see where the time is going. Probably you don't really want to profile something that takes 14 minutes, so perhaps profile it on a subset of the input.

I have an idea - you remember the old issue of MapFile's "index" being corrupted, if Fetcher was interrupted. Random accesses to MapFile's would take ages in that case. Does calculating splits involve random access to the segment's MapFiles?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to