Thanks Doug. I have around 500 directories. Each directory has around 500 files each 25 MB gzip (uncompressed around 140 MB each). A file uncompressed has around 170,000 lines. Each line on average is about .85 kb.
I have just started looking at Hadoop source code. How can I use each file a distinct split? Already my data is evenly distributed across these compressed files. I see Hadoop using abstracted JAVA files for File IO. What files are appropriate to make changes so that on MapClass, inside Map function if I call value.toString() returns a record? Hope this helps, VJ > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 11, 2006 11:09 AM > To: [email protected] > Subject: Re: reading zip files > > Vijay Murthi wrote: > > I am trying to process several gigs of zipped text files from a > directory. If I unzip them the size increase atleast 4 times and > potentially I can run out of disk space. > > > > Has anyone tried to read zipped text files directly from the input > directory? > > > > or anyone tried implementing a zip version of > SequenceFileRecordReader.java and Filesplit? > > SequenceFile currently supports per-record compression. This is > effective when your input records are fairly large (> a few kB). > > What format are your zipped input files in? Are there multiple records > per file? If so, how big are the records? A future goal for > SequenceFile is to support compression across multiple records, to make > compression effective with small records. Until then, compression of > small records is difficult. The best approach currently is to use an > InputFormat that does not split files, but makes each file into a > distinct split. Then try to divide your data into approximately equal > sized files that are each compressed. > > Doug
