Row Filters in TableInputFormatBase

Dave Latham Wed, 04 Feb 2009 16:09:59 -0800

In order to speed up a map reduce job operating on HBase input data, we
recently added a RowFilter to the input format.  However, when trying to
execute it, map tasks (one per region) that used to take 1-2 minutes began
timing out after 10 minutes.  So I dug in to TableInputFormatBase to see how
it handles a row filter, and it appears to take out filter and combine it
with a StopRowFilter in order to scan the proper split, since there is no
getScanner method that can accept both a stop row and a row filter.  Digging
further in to the scanning / filtering, it looks like it continues scanning
filterAllRemaining returns true.  However,
StopRowFilter.filterAllRemaining() always returns false.  So if my
understanding is correct, every split in this task will end up scanning to
the end of the table and testing every row with the filter instead of simply
stopping at the end of it's given split.  That would explain why my map
tasks began taking longer (instead of shorter).


1. Is my understanding correct?  (aka is this a bug?  If so, I don't see an
existing JIRA issue for it -- I can open one if no one else does.)
2. If so, should the StopRowFilter filterAllRemaining once the stop row has
been reached?  Or should the TableInputFormatBase wrap it in a
WhileMatchRowFilter for the same effect?
3. Is there a reason why HTable does not support requesting a scanner with
both an end row and a row filter - forcing all clients to add these extra
filters?

Thanks!
Dave

Hadoop / Hbase 0.19.0

Row Filters in TableInputFormatBase

Reply via email to