I'm going to start a vote about this tomorrow if there are not more
comments about this.

On Thu, Jun 25, 2020 at 5:41 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> I updated the PR to fix some issues with my edits that Antoine pointed
> out. I can start working on a C++ patch to implement the C++ changes
> in the next few days if that helps. Given the time urgency of deciding
> what to do on this if anyone else could express opinions it would be
> helpful.
>
> I see one of the major benefits of this that when forming a sparse
> union (the one without the offsets buffer) the child bitmaps do not
> have to be spliced together to create the top-level bitmap with the
> assistance of the type ids buffer, which when you think about it is a
> rather complicated operation.
>
> On Wed, Jun 24, 2020 at 5:34 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > I drafted the specification changes that would be associated with the
> > union changes
> >
> > https://github.com/apache/arrow/pull/7535
> >
> > I'll start a separate discussion about incrementing the
> > MetadataVersion since that must be discussed independently.
> >
> > Please take a look
> >
> > On Wed, Jun 24, 2020 at 3:50 PM Wes McKinney <wesmck...@gmail.com> wrote:
> > >
> > > I should also add that we could (with some effort) use the
> > > MetadataVersion V4/V5 indicator to offer backward compatibility for
> > > old serialized union data
> > >
> > > In any case, if there is consensus about this, we would need to have a
> > > vote and get busy with implementing and testing the changes. I could
> > > assist with the C++ changes
> > >
> > > On Wed, Jun 24, 2020 at 1:14 PM Wes McKinney <wesmck...@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 24, 2020 at 1:07 PM Francois Saint-Jacques
> > > > <fsaintjacq...@gmail.com> wrote:
> > > > >
> > > > > OTOH,
> > > > >
> > > > > how do we handle NullType -> UnionType<T...> cast conversion? Do we
> > > > > require some convention like the first children ArrayData null bitmap
> > > > > to be set and all tags set to 0?
> > > >
> > > > Sure, that sounds like a reasonable implementation should this
> > > > operation actually be required.
> > > >
> > > > > François
> > > > >
> > > > > On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou <anto...@python.org> 
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> > > > > > > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou 
> > > > > > > <anto...@python.org> wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > > > > > >>> hi folks,
> > > > > > >>>
> > > > > > >>> As discussed on the recent GitHub PR [1], as a means of 
> > > > > > >>> reconciling
> > > > > > >>> the long-standing cross-implementation incompatibilities with 
> > > > > > >>> Union
> > > > > > >>> types, it's been proposed to remove the top-level validity 
> > > > > > >>> bitmap from
> > > > > > >>> the Union data layout and let validity be determined 
> > > > > > >>> exclusively by
> > > > > > >>> the child arrays of the union. So the only additional data 
> > > > > > >>> needed to
> > > > > > >>> form a union are the type ids (and for the dense union, the 
> > > > > > >>> offsets).
> > > > > > >>>
> > > > > > >>> I do not think this change meaningfully alters the semantics of 
> > > > > > >>> Union
> > > > > > >>> types and I think it also simplifies their construction, so I 
> > > > > > >>> would be
> > > > > > >>> in favor of making it for 1.0.0.
> > > > > > >>
> > > > > > >> So it sounds like this may break compatibility with existing 
> > > > > > >> only uses
> > > > > > >> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, 
> > > > > > >> Red
> > > > > > >> Arrow); not only on the API side, but on the data side.
> > > > > > >
> > > > > > > Right. However, I don't think these changes will be very 
> > > > > > > disruptive,
> > > > > > > and we always knew that this disruption was possible because of 
> > > > > > > the
> > > > > > > hitherto unreconciled issues with Unions. The applications that 
> > > > > > > I'm
> > > > > > > aware of that use Union serialization (e.g. Ray) use it only for
> > > > > > > ephemeral serialization.
> > > > > >
> > > > > > Ok, that's a convincing argument.
> > > > > >
> > > > > > > In general, I think that we should be bumping the metadata 
> > > > > > > version [1]
> > > > > > > for 1.0.0 to create a forcing function for upgrade to the
> > > > > > > format-stable line of libraries. The C++/Python libraries could 
> > > > > > > have a
> > > > > > > "compatibility mode" (like the "write_legacy_ipc_format" options) 
> > > > > > > that
> > > > > > > writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain 
> > > > > > > features
> > > > > > > (like unions -- which are not needed for Spark for example) 
> > > > > > > disabled.
> > > > > >
> > > > > > Hmm, I hope we can keep the negotiation minimal.  We should take 
> > > > > > from
> > > > > > the Jon Postel principle - be liberal in what you accept, strict in 
> > > > > > what
> > > > > > you emit.
> > > > > >
> > > > > > So the IPC reader can have a simple detection that goes this way:
> > > > > >
> > > > > >   * if we receive 1 buffer for sparse union or 2 buffers for dense 
> > > > > > union
> > > > > > => it's the new-style format, there's nothing to do
> > > > > >
> > > > > >   * if we receive 2 (non-null) buffers for sparse union or 3 
> > > > > > (non-null)
> > > > > > buffers for dense union
> > > > > > => it's the old format, we should AND the parent bitmap into each 
> > > > > > of the
> > > > > > child bitmaps
> > > > > >
> > > > > > We can also add a flag to IpcOptions to enable/disable 
> > > > > > compatibility tricks.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.

Reply via email to