Re: Text files vs. SequenceFiles

Alex Loddengaard Fri, 02 Jul 2010 15:15:34 -0700

Hi David,

On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch <[email protected]>wrote:
>
> * We should use a SequenceFile (binary) format as it's faster for the
> machine to read than parsing text, and the files are smaller.
>
> * We should use a text file format as it's easier for humans to read,
> easier to change, text files can be compressed quite small, and a) if the
> text format is designed well and b) given the context of a distributed
> system like Hadoop where you can throw more nodes at a problem, the text
> parsing time will wind up being negligible/irrelevant in the overall
> processing time.
>


SequenceFiles can also be compressed, either per record or per block.  This
is advantageous if you want to use gzip, because gzip isn't splittable.  A
SF compressed by blocks is therefor splittable, because each block is
gzipped vs. the entire file being gzipped.

As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
SequenceFiles.

Lastly, I promise that eventually you'll run out of space in your cluster
and wish you did better compression.  Plus compression makes jobs faster.

The general recommendation is to use SequenceFiles as early in your ETL as
possible.  Usually people get their data in as text, and after the first MR
pass they work with SequenceFiles from there on out.

Alex

Re: Text files vs. SequenceFiles

Reply via email to