Hi Everyone,
I am working on a project which takes in data from a lot of text files, and
although there are a lot of ways to do it, it is not clear to me which is
the best/fastest. I am working on an EC2 cluster with approximately 20
machines.

The data is currently spread across 20k text files (total of a ~3gb ), each
of which needs to be treated as a whole (no splits within those files), but
I am willing to change around the format if I can get increased speed. Using
the regular TextInputFormat adjusted to take in entire files is pretty slow
since each file takes a minimum of about ~3 seconds no matter how small it
is. 

>From what I have read the possible options to proceed with are as follows:
1. Use MultiFileInputSplit, it seems to be designed for this sort of
situation, but I have yet to see an implementation of this, or a commentary
on its performance increase over the regular input.
2. Read the data in, and output it as a Sequence File and use the sequence
file as input from there on out.
3. Condense the files down to a small number of files (say ~100) and then
delimit the files so each part gets a separate record reader. 

If anyone could give me guidance as to what will provide the best
performance for this setup, I would greatly appreciate it.

Thanks for your help



-- 
View this message in context: 
http://www.nabble.com/Handling-Large-Number-Of-Files%2C-Fastest-Way-tp17297485p17297485.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to