Thanks Doug. I have around 500 directories. Each directory has around
500 files each 25 MB gzip (uncompressed around 140 MB each). A file
uncompressed has around 170,000 lines. Each line on average is about .85
kb. 

I have just started looking at Hadoop source code. How can I use each
file a distinct split? Already my data is evenly distributed across
these compressed files.  

I see Hadoop using abstracted JAVA files for File IO. What files are
appropriate to make changes so that on MapClass, inside Map function if
I call value.toString() returns a record?


Hope this helps,
VJ



> -----Original Message-----
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Thursday, May 11, 2006 11:09 AM
> To: [email protected]
> Subject: Re: reading zip files
> 
> Vijay Murthi wrote:
> > I am trying to process several gigs of zipped text files from a
> directory. If I unzip them the size increase atleast 4 times and
> potentially I can run out of disk space.
> >
> > Has anyone tried to read zipped text files directly from the input
> directory?
> >
> > or anyone tried implementing a zip version of
> SequenceFileRecordReader.java and Filesplit?
> 
> SequenceFile currently supports per-record compression.  This is
> effective when your input records are fairly large (> a few kB).
> 
> What format are your zipped input files in?  Are there multiple
records
> per file?  If so, how big are the records?  A future goal for
> SequenceFile is to support compression across multiple records, to
make
> compression effective with small records.  Until then, compression of
> small records is difficult.  The best approach currently is to use an
> InputFormat that does not split files, but makes each file into a
> distinct split.  Then try to divide your data into approximately equal
> sized files that are each compressed.
> 
> Doug


Reply via email to