I've written up the ColumnBag proposal addressing items 1 and 2 on the
list. I'm open to any and all feedback/suggestions.

I'd be happy to add item 3 (binary metadata) to the proposed change set.
Let me know if you want me to whip up the initial suggestion for that
version (and whether or not to keep it separate from ColumnBag).

Would RLE related efforts change the structure of RecordBatch or ColumnBag
(if accepted)?

Here is the brief history-discussion around why ColumnBag:
https://docs.google.com/document/d/1jsmmqLTyJkU8fx0sUGIqd6yu72N4v9uHFsuGSgB_DfE/

Here is a brief commit doctoring up the flatbuffer to support this version
of the proposed change:
https://github.com/nbauernfeind/arrow/tree/column_bag_demo_v1

I don't know if it's better to comment in the document or bring comments
back to the list. If it ends up being document heavy, then I'll summarize
the main points back on the list.

I think I'll get started on a Java impl just to learn more even if it ends
up being extra work.

Looking forward to your feedback,
Nate

On Mon, Aug 9, 2021 at 10:06 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I'm still interested in RLE related effort, but not sure about my available
> bandwidth (which is why I haven't made more of an effort there).
>
> On Tue, Aug 3, 2021 at 6:00 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Another Flatbuffers/Message.fbs project we should rekindle soon, in
> > addition to the schema evolution/replacement question which has been
> > raised with Flight, is that of sparse/compressed data (e.g. RLE). I
> > have a vacation plus some travel coming up so won't be able to devote
> > meaningful attention to this until the last part of August, but would
> > like to help it move forward.
> >
> >
> > On Tue, Jul 27, 2021 at 1:40 PM David Li <lidav...@apache.org> wrote:
> > >
> > > Hey Nate,
> > >
> > > For the first two points, semantically I'm tempted to think of it more
> > like the ability to send a "bag of columns" according to some schema (and
> > hence columns could have differing lengths or even be absent). This could
> > be a new structure alongside a record batch, which is semantically like a
> > "slice of a table" (and hence rectangular and complete), instead of
> > exposing existing users of RecordBatch to rather different behavior.
> > >
> > > For #3, a different thread was discussing some of the points there - it
> > sounds like it may be possible to relax from map<string, string> to
> > map<string, binary>.
> > >
> > > -David
> > >
> > > On Mon, Jul 26, 2021, at 11:01, Nate Bauernfeind wrote:
> > > > Wes suggested that maybe there are enough new ideas that it may make
> > sense
> > > > to evolve-past the existing structures rather than to bolt-on new
> > > > functionality. I would like to learn what requirements exist should
> new
> > > > structures be adopted, and if applicable, would like to turn this
> into
> > a
> > > > full POC proposal.
> > > >
> > > > These are the features that I feel are missing from the existing
> > design:
> > > > - the ability to notify that the columns are not consistent in length
> > (e.g.
> > > > setting RecordBatch.length to -1; and give the arrow/flight user the
> > true
> > > > FieldNode lengths).
> > > > - the ability to skip top-level field nodes that have length 0 at a
> > small
> > > > cost (such as in a bitset)
> > > > - the ability to embed binary payload in the Message flatbuffer
> wrapper
> > > > (instead of String payload only)
> > > > - the ability to concurrently use more than one schema (the most
> > likely API
> > > > will look like how one identifies a dictionary. ideally dictionaries
> > could
> > > > be shared across field nodes in a schema or across schemas in the
> same
> > > > flight)
> > > >
> > > > What other features, or improvements, could/should be considered? Any
> > > > strong opinions against the ideas above? (Remember, that a goal of
> > mine is
> > > > to be able to send a RecordBatch of rows that were modified
> intersected
> > > > only by the field-nodes that have changed (including those with only
> > inner
> > > > node changes); thus the columns are a subset of the full schema and
> > that
> > > > the length of each node is independent of the other).
> > > >
> > > > On Fri, Jul 9, 2021 at 9:26 AM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > > > > It sounds like we may want to discuss some potential evolutions of
> > the
> > > > > Arrow binary protocol (for example: new Message types). Certainly a
> > > > > can of worms but rather than trying to bolt some new functionality
> > > > > onto the existing structures, it might be better to support the new
> > > > > use cases through some new structures which will be more clear cut
> > > > > from a forward compatibility standpoint.
> > > >
> > > > Nate
> > > >
> > > > --
> > > >
> >
>


--

Reply via email to