MapFiles provide a cheap index. there was a discussion on the list in the last 
week i think over this. u would still have to write ur own record reader i 
would think (to not process splits that fall below the starting location - 
where the starting location is determined by index lookup). 

Writing ur own splitter would be even better - because with the recordreader 
option - all the map tasks are still going to get scheduled (although many may 
terminate immediately). with the splitter - u can construct splits that only 
lie in the interesting interval.

if u don't have an index - u can still do it relatively cheaply. in the 
recordreader - u can read the first value from the next split - and decide 
based on that whether this map task is interesting or not. 


-----Original Message-----
From: Andy Pavlo [mailto:[EMAIL PROTECTED]
Sent: Tue 3/4/2008 9:11 PM
To: [email protected]
Subject: Using Sorted Files For Filtering Input (File Index)
 
Let's say I have a simple data file with <key, value> pairs and the entire 
file is ascending sorted order by 'value'. What I want to be able to do is 
filter the data so that the map function is only invoked with <key, value> 
pairs where 'value' is greater than some input value.

Does such a feature already exist or would I need to implement my own 
RecordReader to do this filter? Is this the right place to do this in 
Hadoop's input pipeline?

What I essentially want is a cheap index. By sorting the values ahead of time, 
you could just do a binary search on the InputSplit until you found the 
starting value that satisfies the predicate. The RecordReader would then 
start this point in the file, read all the lines in, and pass the records to 
map().

Any thoughts?
-- 
Andy Pavlo
[EMAIL PROTECTED]

Reply via email to