Hi,You can realize a huge improvement by sticking them into a sequence file. With lots of small files, name lookups against the name node will be a big bottleneck.
One easy approach is making the key be a Text of the filename that was loaded in, and the value be a BytesWritable, which is the contents of the file. Since they're relatively small files (or you wouldn't be having this problem), you won't have to worry about OOMing yourself. It worked really well for me, dealing with a few hundred thousand ~4MB files.
Brian jkupferman wrote:
Hi Everyone, I am working on a project which takes in data from a lot of text files, and although there are a lot of ways to do it, it is not clear to me which is the best/fastest. I am working on an EC2 cluster with approximately 20 machines. The data is currently spread across 20k text files (total of a ~3gb ), each of which needs to be treated as a whole (no splits within those files), but I am willing to change around the format if I can get increased speed. Using the regular TextInputFormat adjusted to take in entire files is pretty slow since each file takes a minimum of about ~3 seconds no matter how small itis.From what I have read the possible options to proceed with are as follows:1. Use MultiFileInputSplit, it seems to be designed for this sort of situation, but I have yet to see an implementation of this, or a commentary on its performance increase over the regular input. 2. Read the data in, and output it as a Sequence File and use the sequence file as input from there on out. 3. Condense the files down to a small number of files (say ~100) and thendelimit the files so each part gets a separate record reader.If anyone could give me guidance as to what will provide the best performance for this setup, I would greatly appreciate it. Thanks for your help
signature.asc
Description: OpenPGP digital signature
