I have recently opened HBASE-28622
<https://issues.apache.org/jira/browse/HBASE-28622> , which has turned out
to be another aspect of the problem discussed in HBASE-20565
<https://issues.apache.org/jira/browse/HBASE-20565> .

The problem is discussed in detail in HBASE-20565
<https://issues.apache.org/jira/browse/HBASE-20565> , but it boils down to
the API design decision that the filters returning SEEK_NEXT_USING_HINT
rely on filterCell() getting called.

On the other hand, some filters maintain an internal row state that sets
counters for calls of filterCell(), which interacts with the results of
previous filters in a filterList.

When filters return different results for filterRowkey(), then filters
returning  SEEK_NEXT_USING_HINT that have returned false must have
filterCell() called, otherwise the scan will degenerate into a full scan.

On the other hand, filters that maintain an internal row state must only be
called if all previous filters have INCLUDEed the Cell, otherwise their
internal state will be off. (This still has caveats, as described in
HBASE-20565 <https://issues.apache.org/jira/browse/HBASE-20565>)

In my opinion, the current code from HBASE-20565
<https://issues.apache.org/jira/browse/HBASE-20565> strikes a bad balance
between features, as while it fixes some use cases for row stateful
filters, it also often negates the performance benefits of the filters
providing hints, which in practice makes them unusable in many filter list
combinations.

Without completely re-designing the filter system, I think that the best
solution would be adding a method to distinguish the filters that can
return hints from the rest of them. (This was also suggested in HBASE-20565
<https://issues.apache.org/jira/browse/HBASE-20565> , but it was not
implemented)

In theory, we have four combinations of hinting and row stateful filters,
but currently we have no filters that are both hinting and row stateful,
and I don't think that there is valid use case for those. The ones that are
neither hinting nor stateful could be handled as either, but treating them
as non-hinting seems faster.

Once we have that, we can improve the filterList behaviour a lot:
- in filterRowKey(), if any hinting filter returns false, then we could
return false
- in filterCell(), rather than returning on the first non-include result,
we could process the remaining hinting filters, while skipping the
non-hinting ones.

The code changes are minimal, we just need to add a new method like
isHinting() to the Filter class, and change the above two methods.

We could add this even in 2.5, by defaulting isHinting() to return false in
the Filter class, which would preserve the current API and behaviour for
existing custom filters.

I was looking at it from the AND filter perspective, but if needed, similar
changes could be made to the OR filter.

What do you think ?
Is this a good idea ?

Istvan

Reply via email to