Let's say I have a simple data file with <key, value> pairs and the entire file is ascending sorted order by 'value'. What I want to be able to do is filter the data so that the map function is only invoked with <key, value> pairs where 'value' is greater than some input value.
Does such a feature already exist or would I need to implement my own RecordReader to do this filter? Is this the right place to do this in Hadoop's input pipeline? What I essentially want is a cheap index. By sorting the values ahead of time, you could just do a binary search on the InputSplit until you found the starting value that satisfies the predicate. The RecordReader would then start this point in the file, read all the lines in, and pass the records to map(). Any thoughts? -- Andy Pavlo [EMAIL PROTECTED]
