Re: [Format] Passing selection masks with Arrow record batches

Francois Saint-Jacques Mon, 28 Jan 2019 06:05:01 -0800

On Mon, Jan 28, 2019 at 12:53 AM Wes McKinney <wesmck...@gmail.com> wrote:


> I was having a discussion recently about Arrow and the topic of
> server-side filtering vs. client-side filtering came up.
>
> The basic problem is this:
>
> If you have a RecordBatch that you wish to filter out some of the
> "rows", one way to track this in-memory is to create a separate array
> of true/false values instead of forcing a materialization of a
> filtered RecordBatch.
>
> So you might have a record batch with 2 fields and 4 rows
>
> a: [1, 2, 3, 4]
> b: ['foo', 'bar', 'baz', 'qux']
>
> and then a filter
>
> is_selected: [true, true, false, true]
>
> This can be easily handled as an application-level concern. Creating a
> bitmask is generally cheap relative to materializing the filtered
> version, and some operators may support "pushing down" such filters
> (e.g. aggregations may accept a selection mask to exclude "off"
> values). I myself implemented such a scheme in the past for a query
> engine I built in the 2013-2014 time frame and it yielded material
> performance improvements in some cases.
>
> One question is what you should do when you want to put the data on
> the wire, e.g. via RPC / Flight or IPC. Two options
>
> * Pass the complete RecordBatch, plus the filter as a "special" field
> and attach some metadata so that you know that filter field is
> "special"
>
> * Filter the RecordBatch before sending, send only the selected rows
>
> The first option can of course be implemented as an application-level
> detail, much as we handle the serialization of pandas row indexes
> right now (where we have custom pandas metadata and "special"
> fields/columns for the index arrays). But it could be a common enough
> use case to merit a more well-defined standardized approach.
>
> I'm not sure what the answer is, but I wanted to describe the problem
> as I see it and see if anyone has any thoughts about it.
>
> I'm aware that Dremio is a user of "selection vectors" (selected
> indices, instead of boolean true/false values, so we would have [0, 1,
> 3] in the above case), so the similar discussion may apply to passing
> selection vectors on the wire.
>

Some remarks:

- The purpose of using selection vector/masks is to exploit late
materialization as long as possible. I think having mask support make sense
in the zero-copy IPC, e.g. same-host or host-to-gpu transfer (no true
zero-copy but... avoid the scatter-copy preceeding the DMA copy).
- Selection vector as opposed the raw bitmaps start making sense when
selectivity is well under 50%. If we add the support of masking in the
protocol, I'd make sure that it would have future compatibility for other
types of compressed bitmap. This could be as simple as a tagged union with
only one type for now.
- I'd also take into account the iterator-composability with the current
null-bitmap. (bitmap & bitmap) is easier to compose than (bitmap &
selection-vector).
- If we _don't_ add masking support to the protocol and still want filter
push-down, we'll need to add a feature which is the dual, a column that
represents the original row id (or the row offset).

Re: [Format] Passing selection masks with Arrow record batches

Reply via email to