I was having a discussion recently about Arrow and the topic of
server-side filtering vs. client-side filtering came up.

The basic problem is this:

If you have a RecordBatch that you wish to filter out some of the
"rows", one way to track this in-memory is to create a separate array
of true/false values instead of forcing a materialization of a
filtered RecordBatch.

So you might have a record batch with 2 fields and 4 rows

a: [1, 2, 3, 4]
b: ['foo', 'bar', 'baz', 'qux']

and then a filter

is_selected: [true, true, false, true]

This can be easily handled as an application-level concern. Creating a
bitmask is generally cheap relative to materializing the filtered
version, and some operators may support "pushing down" such filters
(e.g. aggregations may accept a selection mask to exclude "off"
values). I myself implemented such a scheme in the past for a query
engine I built in the 2013-2014 time frame and it yielded material
performance improvements in some cases.

One question is what you should do when you want to put the data on
the wire, e.g. via RPC / Flight or IPC. Two options

* Pass the complete RecordBatch, plus the filter as a "special" field
and attach some metadata so that you know that filter field is
"special"

* Filter the RecordBatch before sending, send only the selected rows

The first option can of course be implemented as an application-level
detail, much as we handle the serialization of pandas row indexes
right now (where we have custom pandas metadata and "special"
fields/columns for the index arrays). But it could be a common enough
use case to merit a more well-defined standardized approach.

I'm not sure what the answer is, but I wanted to describe the problem
as I see it and see if anyone has any thoughts about it.

I'm aware that Dremio is a user of "selection vectors" (selected
indices, instead of boolean true/false values, so we would have [0, 1,
3] in the above case), so the similar discussion may apply to passing
selection vectors on the wire.

- Wes

Reply via email to