Hi, All,
The input of my MapReduce job is two large txt files. And an InputSplit
consists of a portion of the file from both files. And this Split is content
dependent. So I have to read the input file to generate a split. Now the thing
is that most of the time is spent in generating these splits. The Map and
Reduce phases actually take less time than that. I was wondering if there is an
efficient way to generate splits from files. My InputFormat class is based on
FileInputFormat. The getSplits function of FileInputFormat doesn't read input
file. But this is impossible for me because my split depends on the content of
the file.
Any ideas or comments are appreciated.