Hello,I have a similar scenario to jkupferman's situation - 1000's of files mostly ranging from Kb,some MBs and few of which GBs. I am not too familiar with java and am using
hadoopstreaming with python. The mapper must work on individual files.
I've placed the 1000's of files into the DFS.
I've given the map job a manifest listing locations of the files,
this is given to Hadoop which streams it to my python script which
then copies the specified filename and processes it.
I also tried tar-ring the files, converting them into a sequence file and then using SequenceFileAsTextInputFormat. The problem with this is that it sends the file contents as a string representation of the bytes, which i would have to convert.
Q: Is there any way I can make it send me the data as BytesWritable(mentioned below), using the command line and python?
Thanks for your time.
Regards
Saptarshi
On May 18, 2008, at 10:54 PM, Brian Vargas wrote:
Hi,You can realize a huge improvement by sticking them into a sequence file. With lots of small files, name lookups against the name node will be a big bottleneck.One easy approach is making the key be a Text of the filename that was loaded in, and the value be a BytesWritable, which is the contents of the file. Since they're relatively small files (or you wouldn't be having this problem), you won't have to worry about OOMing yourself. It worked really well for me, dealing with a few hundred thousand ~4MB files.Brian jkupferman wrote:Hi Everyone,I am working on a project which takes in data from a lot of text files, and although there are a lot of ways to do it, it is not clear to me which is the best/fastest. I am working on an EC2 cluster with approximately 20machines.The data is currently spread across 20k text files (total of a ~3gb ), each of which needs to be treated as a whole (no splits within those files), but I am willing to change around the format if I can get increased speed. Using the regular TextInputFormat adjusted to take in entire files is pretty slow since each file takes a minimum of about ~3 seconds no matter how small itis.From what I have read the possible options to proceed with are as follows:1. Use MultiFileInputSplit, it seems to be designed for this sort ofsituation, but I have yet to see an implementation of this, or a commentaryon its performance increase over the regular input.2. Read the data in, and output it as a Sequence File and use the sequencefile as input from there on out.3. Condense the files down to a small number of files (say ~100) and then delimit the files so each part gets a separate record reader. If anyone could give me guidance as to what will provide the bestperformance for this setup, I would greatly appreciate it. Thanks for your help
Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha The typewriting machine, when played with expression, is no more annoying than the piano when played by a sister or near relation. -- Oscar Wilde
smime.p7s
Description: S/MIME cryptographic signature
