Thanks for all the input:

> I think having support for this in some way in the IPC
> protocol makes sense (it seems slightly less important for the C API
> but worth thinking about

The way I read Jacques e-mail is it seems like the opposite might be true
(at least for Dremio).  For IPC I think there is probably a sweet spot
where it doesn't pay to compact the batches but it would like take some
tuning.


> The question is how mechanically, would it be some extra buffers at
> the start or end of the record batch body (probably have to be at the
> end of the body for forward compatibility reasons)?

I think for RecordBatch it would be an extra buffer either at the beginning
for the end.  Its possible putting at the end would allow better forwards
compatibility.  I haven't really given much thought on design here.  My
main concern is to define appropriate metadata before 1.0.0 to maintain
forwards compatibility.  My thinking is the metadata would be an enum or
null table that indicates "no filters".  Implementations could then
determine if they know how to understand the corresponding buffers
correctly based on the metadata.

I can try to put up a straw-man PR for metadata if we think this is worth
pursuing further.

Thanks,
Micah

P.S. This also raises a slightly related concern about letting applications
negotiate "capabilities" at a finer grained level (e.g. letting the
transmitter know that the receive only supports unfiltered values).

On Mon, Jan 27, 2020 at 8:34 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah -- I think having support for this in some way in the IPC
> protocol makes sense (it seems slightly less important for the C API
> but worth thinking about). It's helpful to know that Dremio (a big
> Arrow user) already employs various filters / selection vectors.
>
> The question is how mechanically, would it be some extra buffers at
> the start or end of the record batch body (probably have to be at the
> end of the body for forward compatibility reasons)?
>
> On Sun, Jan 26, 2020 at 1:16 PM Jacques Nadeau <jacq...@apache.org> wrote:
> >
> > At Dremio, we use four main types of selection vector/bitmaps:
> >
> > Dense Format (record valid or not, no ordering)
> > - single bit (bitmap)
> >
> > Sparse formats (identifies valid records as well as their order)
> > - 2 byte (for record batches up to 2^16 records).
> > - 4 byte (for 2^16 batches of 2^16 records);
> > - 6 byte (for 2^32 batches of 2^16 records);
> >
> > We've considered introducing a couple more. I imagine for other use
> cases,
> > where people use much larger batches of records, different requirements
> > would be necessary. My reason for sharing is it seems like this may be
> > use-case specific. I'd also note that at the IPC level, you'd generally
> > want to contract batches before dropping them on the wire (or at least
> that
> > is what we typically do).
> >
> > On Fri, Jan 24, 2020 at 11:23 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > I was thinking selection vector/bitmap (possibly with different
> encodings),
> > > but really nothing for now.  Ordinarily, I'd lean towards YAGNI but
> there
> > > isn't a good way to add this in easily in a forward compatible way
> unless
> > > we add a placeholder enum/table for 1.0 (the default option would be no
> > > filter and wouldn't change the packaged data at all).
> > >
> > > On Fri, Jan 24, 2020 at 4:55 AM Francois Saint-Jacques <
> > > fsaintjacq...@gmail.com> wrote:
> > >
> > > > By filter, you mean a filter expression, or a selection
> vector/bitmap?
> > > >
> > > > On Thu, Jan 23, 2020 at 11:38 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > > >
> > > > > One of the things that I think got overlooked in the conversation
> on
> > > > having
> > > > > a slice offset in the C API was a suggestion from Jacques of
> perhaps
> > > > > generalizing the concept to an arbitrary "filter" for arrays/record
> > > > batches.
> > > > >
> > > > > I believe this point was also discussed in the past as well.  I'm
> not
> > > > > advocating for adding it now but I'm curious if people feel we
> should
> > > add
> > > > > something to Schema.fbs for forward compatibility,  in case we
> wish to
> > > > > support this use-case in the future.
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > >
> > >
>

Reply via email to