Vijay Murthi wrote:
I am trying to process several gigs of zipped text files from a directory. If I unzip them the size increase atleast 4 times and potentially I can run out of disk space. Has anyone tried to read zipped text files directly from the input directory? or anyone tried implementing a zip version of SequenceFileRecordReader.java and Filesplit?

SequenceFile currently supports per-record compression. This is effective when your input records are fairly large (> a few kB).

What format are your zipped input files in? Are there multiple records per file? If so, how big are the records? A future goal for SequenceFile is to support compression across multiple records, to make compression effective with small records. Until then, compression of small records is difficult. The best approach currently is to use an InputFormat that does not split files, but makes each file into a distinct split. Then try to divide your data into approximately equal sized files that are each compressed.

Doug

Reply via email to