On Wed, Feb 4, 2009 at 4:09 PM, Dave Latham <[email protected]> wrote:
> In order to speed up a map reduce job operating on HBase input data, we > recently added a RowFilter to the input format. However, when trying to > execute it, map tasks (one per region) that used to take 1-2 minutes began > timing out after 10 minutes. So I dug in to TableInputFormatBase to see > how > it handles a row filter, and it appears to take out filter and combine it > with a StopRowFilter in order to scan the proper split, since there is no > getScanner method that can accept both a stop row and a row filter. > Digging > further in to the scanning / filtering, it looks like it continues scanning > filterAllRemaining returns true. However, > StopRowFilter.filterAllRemaining() always returns false. So if my > understanding is correct, every split in this task will end up scanning to > the end of the table and testing every row with the filter instead of > simply > stopping at the end of it's given split. That would explain why my map > tasks began taking longer (instead of shorter). > 1. Is my understanding correct? (aka is this a bug? If so, I don't see an > existing JIRA issue for it -- I can open one if no one else does.) Sounds like a bug (and an explanation for long-running jobs) but, IIUC, stop row filter supposed to have a 'stop row' embedded and once filter passes it out, then we stop filltering? If thats not going on, lets fix it. St.Ack P.S. Thanks for digging in.
