Andrzej Bialecki wrote:
I have an idea - you remember the old issue of MapFile's "index" being corrupted, if Fetcher was interrupted. Random accesses to MapFile's would take ages in that case. Does calculating splits involve random access to the segment's MapFiles?
No, calculating splits just lists directories and then gets the size of each file. So this could point to an NDFS name node performance problem, or an RPC performance problem. Each file size request is an RPC, and there could be hundreds or even thousands of input files, but even thousands of RPCs shouldn't take 14 minutes.
Doug
