Large(Thousands) of files -fast

Saptarshi Guha Sun, 18 May 2008 20:17:51 -0700

Hello,

I have a similar scenario to jkupferman's situation - 1000's of files mostly ranging from Kb,some MBs and few of which GBs. I am not too familiar with java and am using

        hadoopstreaming with python. The mapper must work on individual files.
        I've placed the 1000's of files into the DFS.

I've given the map job a manifest listing locations of the files, this is given to Hadoop which streams it to my python script which then copies the specified filename and processes it.

I also tried tar-ring the files, converting them into a sequence file and then using SequenceFileAsTextInputFormat. The problem with this is that it sends the file contents as a string representation of the bytes, which i would have to convert.

Q: Is there any way I can make it send me the data as BytesWritable(mentioned below), using the command line and python?

        
        Thanks for your time.
        Regards 
        Saptarshi


On May 18, 2008, at 10:54 PM, Brian Vargas wrote:

Hi,
You can realize a huge improvement by sticking them into a sequence file. With lots of small files, name lookups against the name node will be a big bottleneck.
One easy approach is making the key be a Text of the filename that was loaded in, and the value be a BytesWritable, which is the contents of the file. Since they're relatively small files (or you wouldn't be having this problem), you won't have to worry about OOMing yourself. It worked really well for me, dealing with a few hundred thousand ~4MB files.
Brian

jkupferman wrote:
Hi Everyone,
I am working on a project which takes in data from a lot of text files, and although there are a lot of ways to do it, it is not clear to me which is the best/fastest. I am working on an EC2 cluster with approximately 20
machines.
The data is currently spread across 20k text files (total of a ~3gb ), each of which needs to be treated as a whole (no splits within those files), but I am willing to change around the format if I can get increased speed. Using the regular TextInputFormat adjusted to take in entire files is pretty slow since each file takes a minimum of about ~3 seconds no matter how small it
is.
From what I have read the possible options to proceed with are as follows:
1. Use MultiFileInputSplit, it seems to be designed for this sort of
situation, but I have yet to see an implementation of this, or a commentary
on its performance increase over the regular input.
2. Read the data in, and output it as a Sequence File and use the sequence
file as input from there on out.
3. Condense the files down to a small number of files (say ~100) and then delimit the files so each part gets a separate record reader. If anyone could give me guidance as to what will provide the best
performance for this setup, I would greatly appreciate it.
Thanks for your help


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha
The typewriting machine, when played with expression, is no more
annoying than the piano when played by a sister or near relation.
                -- Oscar Wilde

smime.p7s
Description: S/MIME cryptographic signature

Large(Thousands) of files -fast

Reply via email to