Hi,

You can realize a huge improvement by sticking them into a sequence file. With lots of small files, name lookups against the name node will be a big bottleneck.

One easy approach is making the key be a Text of the filename that was loaded in, and the value be a BytesWritable, which is the contents of the file. Since they're relatively small files (or you wouldn't be having this problem), you won't have to worry about OOMing yourself. It worked really well for me, dealing with a few hundred thousand ~4MB files.

Brian

jkupferman wrote:
Hi Everyone,
I am working on a project which takes in data from a lot of text files, and
although there are a lot of ways to do it, it is not clear to me which is
the best/fastest. I am working on an EC2 cluster with approximately 20
machines.

The data is currently spread across 20k text files (total of a ~3gb ), each
of which needs to be treated as a whole (no splits within those files), but
I am willing to change around the format if I can get increased speed. Using
the regular TextInputFormat adjusted to take in entire files is pretty slow
since each file takes a minimum of about ~3 seconds no matter how small it
is.
From what I have read the possible options to proceed with are as follows:
1. Use MultiFileInputSplit, it seems to be designed for this sort of
situation, but I have yet to see an implementation of this, or a commentary
on its performance increase over the regular input.
2. Read the data in, and output it as a Sequence File and use the sequence
file as input from there on out.
3. Condense the files down to a small number of files (say ~100) and then
delimit the files so each part gets a separate record reader.
If anyone could give me guidance as to what will provide the best
performance for this setup, I would greatly appreciate it.

Thanks for your help




Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to