Re: Behaviour of filters within scans

Juhani Connolly Sun, 18 Apr 2010 22:28:58 -0700

Thanks for your response

On 04/19/2010 12:59 PM, Ryan Rawson wrote:

I think all the functionality is there between these 2 calls:


Filter#filterKeyValue(KeyValue kv);
and
Filter#filterRow();

In the first call you can cache the KeyValues locally in the filter
state (in a List<KeyValue>  for example).  In the last call you can do
your custom logic based on all the KeyValues you have seen.  There is
little to no cost to do this, since retaining references to a KeyValue
is cheap (ish, relatively, etc).

But ultimately the only thing I can do with Filter#filterRow() is dropthe full row? Am I missing something here? Were I to store references toall the key values that have passed through at most I could zero outtheir buffers in the #filterRow call? I'm not sure what the consequencesof this might be afterwords as the scanner tries to send a load of emptycells. Looking at HRegionServer#next(final long scannerId, int nbRows),it seems to me that they would get packed into Result to get sent backto the client. I could certainly cut down on a lot of transfer by justsending "empty" keyvalues, but it still seems like a lot of overheadthat could be lost by a small api change. Or am I missing something here?

The filter implementation has changed a bit since August 2009, and it
might be possible to create a call like
Filter#filterRow(List<KeyValue>  results) that is called at the "end"
of a row... you can get the same effect as I noted above.  It is just
a matter of API, not of semantics.

Having followed the code, it did seem like it would be trivial toimplement such an extra api either before or after theFilter#filterRow(). I believe the option of having the ability to knockkeyvals out of the list would save on processing later.I would be happy to try putting together the minor modification toRegionScanner and adding a unit test if such a modification were welcome.

I would generally discourage you from structuring your data to fit an
internal implementation detail.  While there are no current plans to
change sorting order, it would make your code more brittle.

I certainly wouldn't want to do it :) I'm going to have to see how muchoverhead I get with a) just dealing with it client end or b) keepingreferences and zeroing the keyvals and go from there.

-ryan

On Sun, Apr 18, 2010 at 8:48 PM, Juhani Connolly<juh...@ninja.co.jp>  wrote:

I've spent some time looking through the regionscanner logic, in particular
the filter related parts and would want to check if a) my current
understanding is correct and b) if this may be subject to change.

short/simplified version to avoid getting sidetracked:
- A RegionScanner is built from a series of scanners attached to each Store.
- This list of scanners is stored in a KeyValueHeap which compares KeyValues
to sort the order in which entries are retrieved by RegionScanner->next
- To check the order in which keys will be returned, and thus filtered one
can look at KeyValue.KeyComparator->compare. It's something like: sort by
row, then column family, then column, then timestamp

Filters are applied as described in
http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/filter/Filter.html

In the end, when using filterKeyValue(KeyValue) one can expect the keyValues
to be sent to it in a sorted order. Will this always be the case?

I ask this because I currently plan to filter the values of col-b based on
the values in col-a. This could be achieved by making sure col-a compares
lower than col-b and storing some kind of data(e.g. a list of "ok"
timestamps) within the custom filter. Does this all sound ok?

Finally it would be nice to see the option to filter a full set, as naming
columns to guarrantee a certain sorting for filters seems pretty dubious:
- Probably in HRegion.Regionserver->next after nextInternal, before
filterRow?
- This would allow a potential filter to go through the gathered results and
prune them depending on intercolumn dependencies?
- I believe it would unlock a lot of possibilities for custom filters that
could cut down on significant amount of transfers where a rows data could be
pruned regionserver side rather than at the client. My particular
application is to only store col-b where there is a col-a with a
corresponding timestamp that matches specific conditions. In my particular
case this results in massive reductions in the amount of cells being sent
from the regionserver.

Any thoughts would be appreciated.

As an aside, I believe HRegion.RegionScanner->nextInternal is doing
filterRowKey for every key in a row even if it has passed once? Is this
intentional behaviour(it seems somewhat unexpected), as otherwise it could
be optimised by just checking the samerow variable.

Re: Behaviour of filters within scans

Reply via email to