Small file Map performance

Aaron Baff Wed, 02 Mar 2011 09:34:25 -0800

So, the problem is we have a crap ton of small files, and a limited sized 
cluster (only 4 nodes, just up from 2, yay!) as we are just starting to use 
Hadoop. With our current hardware, we have 32 Map slots, and >1500 files. The 
Task startup time is, frankly, killing us, and at this time we can't easily 
concat them all into a single file as we are receiving them in, and we want to 
run some analysis on them while they are still inbound. Several months ago we 
played around with the JVM re-use, but if I recall correctly a Task stays keyed 
to an individual MR Job until it hit's it's TTL, and then that slot becomes 
available for another Job. Is there a way to adjust this TTL? Or be able to 
re-use the JVM for a different Job? This is all with 0.21.0.



--Aaron

Small file Map performance

Reply via email to