Re: Row Filters in TableInputFormatBase

stack Sat, 07 Feb 2009 13:21:46 -0800

On Wed, Feb 4, 2009 at 4:09 PM, Dave Latham <[email protected]> wrote:


> In order to speed up a map reduce job operating on HBase input data, we
> recently added a RowFilter to the input format.  However, when trying to
> execute it, map tasks (one per region) that used to take 1-2 minutes began
> timing out after 10 minutes.  So I dug in to TableInputFormatBase to see
> how
> it handles a row filter, and it appears to take out filter and combine it
> with a StopRowFilter in order to scan the proper split, since there is no
> getScanner method that can accept both a stop row and a row filter.
>  Digging
> further in to the scanning / filtering, it looks like it continues scanning
> filterAllRemaining returns true.  However,
> StopRowFilter.filterAllRemaining() always returns false.  So if my
> understanding is correct, every split in this task will end up scanning to
> the end of the table and testing every row with the filter instead of
> simply
> stopping at the end of it's given split.  That would explain why my map
> tasks began taking longer (instead of shorter).


> 1. Is my understanding correct?  (aka is this a bug?  If so, I don't see an
> existing JIRA issue for it -- I can open one if no one else does.)


Sounds like a bug (and an explanation for long-running jobs) but, IIUC, stop
row filter supposed to have a 'stop row' embedded and once filter passes it
out, then we stop filltering?  If thats not going on, lets fix it.

St.Ack
P.S. Thanks for digging in.

Re: Row Filters in TableInputFormatBase

Reply via email to