Hi Wes,

I agree with you that this is a data representation issue.

My point is that, data representation and data operation are closely
related.
As far as this issue is concerned, if we allow several values in the union
vector to be mapped to the same value in the underlying vector, it is
possible that when we modify one value in the union vector, the other value
is also modified, which is unexpected.

This is a problem with our current specification, because our
vectors/arrays provide set/write APIs.
So we may need a "coherency protocol" to define the behavior (e.g. copy on
write) when trying to modify a shared value, IMO.

Best,
Liya Fan

On Sat, Nov 23, 2019 at 3:31 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Liya,
>
> I don't understand your point -- we are strictly discussing data
> representation here I believe. From a data representation perspective,
> there is no conflict with repeated or non-monotonic offset values.
>
> On Fri, Nov 22, 2019 at 1:49 AM Fan Liya <liya.fa...@gmail.com> wrote:
> >
> > This is an interesting question.
> > IMO, to support repeated values, we also need to design a "coherency
> > protocol", to avoid the scenario where once a value is witten, the change
> > is propagated to another slot unexpectedly.
> >
> > Best,
> > Liya Fan
> >
> > On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hmm, I also thought the intention was monotonically increasing. I can't
> > > think of a strong reason one way or another. If the argument about
> code to
> > > do random access is the same in all cases, is there any benefit to
> forcing
> > > any order at all?  Memory prefetching?
> > >
> > > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > hi Antoine,
> > > >
> > > > It's a good question.
> > > >
> > > > The intent when we wrote the specification was to be strictly
> > > > monotonic, but there seems nothing especially harmful about relaxing
> > > > the constraint to allow for repeated values or even non-monotonicity
> > > > (strict or otherwise). For example, if we had the union
> > > >
> > > > ['a', 'a', 'a', 0, 1, 'b', 'b']
> > > >
> > > > then this could be represented as
> > > >
> > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > offsets: [0, 0, 0, 0, 1, 1, 1]
> > > > child[0]: ['a', 'b']
> > > > child[1]: [0, 1]
> > > >
> > > > or
> > > >
> > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > offsets: [1, 1, 1, 0, 1, 0, 0]
> > > > child[0]: ['b', 'a']
> > > > child[1]: [0, 1]
> > > >
> > > > What do others think? Either way some clarification in the
> > > > specification would be useful. Because the code used to do random
> > > > access is the same in all cases, I feel weakly supportive of removing
> > > > constraints on the offsets.
> > > >
> > > > - Wes
> > > >
> > > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'd like some clarification on the spec and intent for dense
> arrays.
> > > > >
> > > > > Currently, it is specified that offsets of a dense union are "in
> order
> > > /
> > > > > increasing" (*).  However, it is not obvious whether repeated
> values
> > > are
> > > > > allowed or not.
> > > > >
> > > > > I suspect the intent is to avoid having people exploit unions as
> some
> > > > > kind of poor man's dictionaries.  Also, perhaps some optimizations
> are
> > > > > possible if monotonic or strictly monotonic indices are assumed?
> But I
> > > > > don't know the history behind the union type.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
> > > >
> > >
>

Reply via email to