In order to speed up a map reduce job operating on HBase input data, we recently added a RowFilter to the input format. However, when trying to execute it, map tasks (one per region) that used to take 1-2 minutes began timing out after 10 minutes. So I dug in to TableInputFormatBase to see how it handles a row filter, and it appears to take out filter and combine it with a StopRowFilter in order to scan the proper split, since there is no getScanner method that can accept both a stop row and a row filter. Digging further in to the scanning / filtering, it looks like it continues scanning filterAllRemaining returns true. However, StopRowFilter.filterAllRemaining() always returns false. So if my understanding is correct, every split in this task will end up scanning to the end of the table and testing every row with the filter instead of simply stopping at the end of it's given split. That would explain why my map tasks began taking longer (instead of shorter).
1. Is my understanding correct? (aka is this a bug? If so, I don't see an existing JIRA issue for it -- I can open one if no one else does.) 2. If so, should the StopRowFilter filterAllRemaining once the stop row has been reached? Or should the TableInputFormatBase wrap it in a WhileMatchRowFilter for the same effect? 3. Is there a reason why HTable does not support requesting a scanner with both an end row and a row filter - forcing all clients to add these extra filters? Thanks! Dave Hadoop / Hbase 0.19.0
