Vijay Murthi wrote:
I have just started looking at Hadoop source code. How can I use each
file a distinct split? Already my data is evenly distributed across
these compressed files.
Implement your own InputFormat.
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html
In particular, your getSplits implementation should return a single
split per input file, ignoring the numSplits parameter.
You can probably subclass InputFormatBase, and have your getSplits
method simply call listPaths() and then construct and return a single
split per path returned.
Your RecordReader implementation might then look something like:
public RecordReader getRecordReader(FileSystem fs, FileSplit split,
JobConf job, Reporter reporter)
throws IOException {
final BufferedReader in =
new BufferedReader(new InputStreamReader
(new GZIPInputStream(fs.open(split.getPath()))));
return new RecordReader() {
long position;
public synchronized boolean next(Writable key, Writable value)
throws IOException {
String line = in.readLine();
if (line != null) {
position += line.length();
((UTF8)value).set(line);
return true;
}
return false;
}
public synchronized long getPos() throws IOException {
return position;
}
public synchronized void close() throws IOException {
in.close();
}
};
}
Then include your InputFormat's class file in your job's jar file.
Doug