Thanks Doug. That worked spectacular!!!! Also, I am completely able to speed up little bit more by setting a big buffer size for InputStreamReader.
-VJ > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 11, 2006 11:51 AM > To: [email protected] > Subject: Re: reading zip files > > Vijay Murthi wrote: > > I have just started looking at Hadoop source code. How can I use each > > file a distinct split? Already my data is evenly distributed across > > these compressed files. > > Implement your own InputFormat. > > http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputF or > mat.html > > In particular, your getSplits implementation should return a single > split per input file, ignoring the numSplits parameter. > > You can probably subclass InputFormatBase, and have your getSplits > method simply call listPaths() and then construct and return a single > split per path returned. > > Your RecordReader implementation might then look something like: > > public RecordReader getRecordReader(FileSystem fs, FileSplit split, > JobConf job, Reporter reporter) > throws IOException { > > final BufferedReader in = > new BufferedReader(new InputStreamReader > (new GZIPInputStream(fs.open(split.getPath())))); > > return new RecordReader() { > long position; > > public synchronized boolean next(Writable key, Writable value) > throws IOException { > String line = in.readLine(); > if (line != null) { > position += line.length(); > ((UTF8)value).set(line); > return true; > } > return false; > } > > public synchronized long getPos() throws IOException { > return position; > } > > public synchronized void close() throws IOException { > in.close(); > } > > }; > } > > Then include your InputFormat's class file in your job's jar file. > > Doug
