You can definitely use the approach that you suggest and you should have good results if you are looking for only a small fraction of the file. Basically, you should have the record reader check to see if any interesting records exist in the current split and if so, read them and if not, just exit.
You should be careful, however, about having to do too many random accesses since they will kill your disk performance. It would be ideal if the record reader could determine whether there are interesting records without reading any data from the input split possibly by reference to an external index of file offsets. Whenever you do find an interesting split, I would recommend that you read the whole split rather than worry about reading only interesting records. That will help avoid random read patterns. Another issue is that if only a small number of records are interesting, then you may have very limited parallelism unless you invoke a very large number of readers. An alternative approach would be to have a side file with file offsets and build an input format that would treat a byte range as if it were a normal input. Essentially what you would have would be FileByteRangeSplit in place of a FileSplit. That would avoid all of the problems with limited parallelism and checking offsets and so on. On 3/4/08 9:11 PM, "Andy Pavlo" <[EMAIL PROTECTED]> wrote: > Let's say I have a simple data file with <key, value> pairs and the entire > file is ascending sorted order by 'value'. What I want to be able to do is > filter the data so that the map function is only invoked with <key, value> > pairs where 'value' is greater than some input value. > > Does such a feature already exist or would I need to implement my own > RecordReader to do this filter? Is this the right place to do this in > Hadoop's input pipeline? > > What I essentially want is a cheap index. By sorting the values ahead of time, > you could just do a binary search on the InputSplit until you found the > starting value that satisfies the predicate. The RecordReader would then > start this point in the file, read all the lines in, and pass the records to > map(). > > Any thoughts?
