Our team is still new to Hadoop, and a colleague and I are trying to make a decision on file formats. The arguments are:

* We should use a SequenceFile (binary) format as it's faster for the machine to read than parsing text, and the files are smaller.

* We should use a text file format as it's easier for humans to read, easier to change, text files can be compressed quite small, and a) if the text format is designed well and b) given the context of a distributed system like Hadoop where you can throw more nodes at a problem, the text parsing time will wind up being negligible/irrelevant in the overall processing time.

I realize I'm leaving out a lot of variables and specifics that could impact this answer, but I'm just wondering if the Hadoop community had any general rules of thumb about this like "favor (binary) sequence files over text files" or some such.

If anyone has any general suggestions/advice here, please post back.

Thanks,

DR

Reply via email to