On Wed, Feb 4, 2009 at 4:09 PM, Dave Latham <[email protected]> wrote:

> In order to speed up a map reduce job operating on HBase input data, we
> recently added a RowFilter to the input format.  However, when trying to
> execute it, map tasks (one per region) that used to take 1-2 minutes began
> timing out after 10 minutes.  So I dug in to TableInputFormatBase to see
> how
> it handles a row filter, and it appears to take out filter and combine it
> with a StopRowFilter in order to scan the proper split, since there is no
> getScanner method that can accept both a stop row and a row filter.
>  Digging
> further in to the scanning / filtering, it looks like it continues scanning
> filterAllRemaining returns true.  However,
> StopRowFilter.filterAllRemaining() always returns false.  So if my
> understanding is correct, every split in this task will end up scanning to
> the end of the table and testing every row with the filter instead of
> simply
> stopping at the end of it's given split.  That would explain why my map
> tasks began taking longer (instead of shorter).


> 1. Is my understanding correct?  (aka is this a bug?  If so, I don't see an
> existing JIRA issue for it -- I can open one if no one else does.)


Sounds like a bug (and an explanation for long-running jobs) but, IIUC, stop
row filter supposed to have a 'stop row' embedded and once filter passes it
out, then we stop filltering?  If thats not going on, lets fix it.

St.Ack
P.S. Thanks for digging in.

Reply via email to