Re: [Format][C++] Offering limited support for unsigned dictionary indices

Wes McKinney Sun, 28 Jun 2020 08:24:05 -0700

I opened a PR

https://github.com/apache/arrow/pull/7567


We should probably vote on this so I will wait another day or so
before opening a vote so we can get this change through for the
release

On Sun, Jun 28, 2020 at 9:38 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> I'll propose a patch to Columnar.rst about this. I think it's OK to
> allow uint64 indices (so we don't have to revisit the issue in a year
> or two or...) but the specification should say that they are not
> preferred.
>
> On Fri, Jun 26, 2020 at 7:20 PM Paul Taylor <ptaylor.apa...@gmail.com> wrote:
> >
> > Responding to this comment from GitHub[1]:
> >
> > > If we had to make a bet about what % of dictionaries empirically are
> > > between 128 and 255 elements, I would bet that the percentage is
> > > small. If it turned out that 40% of dictionaries fell in that range
> > > then I would agree that this makes sense.
> >
> > I agree, the case where you'd use uint8 vs. int16 isn't very motivating,
> > since those are probably small tables or temporary allocations inside
> > larger operations.
> >
> > The case where you'd use uint16 vs. int32 is more motivating, since
> > dictionaries between 32767 and 65535 elements could easily back larger
> > columns and saving up to 1GiB on the keys can quickly add up.
> >
> > > I would recommend that the specification advise
> > > against use of 64-bit indices at all unless that are actually needed
> > > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > > UINT32_MAX elements
> >
> > Agreed.
> >
> > > but doesn't strike me as being a common occurrence
> >
> > This is somewhat common in certain graph operations on large datasets,
> > but I concede I may not be a representative sample of Arrow users :)
> >
> > Paul
> >
> > 1. https://github.com/rapidsai/cudf/pull/5501#issuecomment-649936352
> >
> > On 6/26/20 1:53 PM, Wes McKinney wrote:
> > > I think that situations where you need uint64 indices are likely to be
> > > exceedingly esoteric. I would recommend that the specification advise
> > > against use of 64-bit indices at all unless that are actually needed
> > > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > > UINT32_MAX elements, but doesn't strike me as being a common
> > > occurrence)
> > >
> > > IIUC, the only change that would be necessary would be a revision to
> > > Columnar.rst and perhaps some language augmentation in Schema.fbs, and
> > > we might want to add some integration tests for unsigned indices to
> > > probe whether each language supports them. The changes needed to
> > > support unsigned integers in C++ are probably not that invasive but I
> > > haven't taken a close look at it yet
> > >
> > > On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor <ptaylor.apa...@gmail.com> 
> > > wrote:
> > >> If positive integers are expected, I'm in favor of supporting unsigned
> > >> index types. I was surprised at Arrow C++ restriction on signed indices
> > >> in the RAPIDS thread, perhaps it's newer than when I ported the logic in 
> > >> JS.
> > >>
> > >> Based on the flatbuffer schemas, dictionary indices could technically be
> > >> any Arrow type, which I assumed was to allow for more
> > >> complex/exotic/custom indexing schemes in the future. JS will allow you
> > >> to specify _any_ Arrow type as the dictionary codes, though using a
> > >> non-numeric type without a custom enumerator is UB.
> > >>
> > >> I'm also curious about how the restriction on dictionary index types
> > >> interacts with streaming delta dictionaries. In theory, you could have a
> > >> streaming data source produce enough delta dictionaries such that the
> > >> total dictionary size grows beyond 2^31-1 elements.
> > >>
> > >> I think that's a valid use-case of delta dictionaries assuming Arrow
> > >> aggregates the dictionaries into multiple RecordBatches (or a
> > >> ChunkedArray), which is what JS does. But if that were allowed, we would
> > >> have to allow 64-bit (signed or unsigned) dictionary index types.
> > >>
> > >> Paul
> > >>
> > >>
> > >> On 6/26/20 5:58 AM, Wes McKinney wrote:
> > >>> hi folks,
> > >>>
> > >>> At the moment, using unsigned integers for dictionary indices/codes
> > >>> isn't exactly forbidden by the metadata [1], which says that the
> > >>> indices must be "positive integers". Meanwhile, the columnar format
> > >>> specification says
> > >>>
> > >>> "When a field is dictionary encoded, the values are represented by an
> > >>> array of signed integers representing the index of the value in the
> > >>> dictionary. The memory layout for a dictionary-encoded array is the
> > >>> same as that of a primitive signed integer layout."
> > >>>
> > >>> I was looking at a discussion in RAPIDS about this topic [2]
> > >>>
> > >>> When we drafted the columnar specification for this, the intention as
> > >>> I recall was only to support signed integer indices. The rationale
> > >>> basically is that:
> > >>>
> > >>> * Better cross platform / language support for signed (e.g. certainly
> > >>> in the JVM)
> > >>> * Supporting 4 instead of 8 index types is less burdensome for the
> > >>> developer and means less code to generate to support them
> > >>> * Unsigned wraparound bugs could pass silently
> > >>>
> > >>> I think it would be feasible to support the unsigned indices with the
> > >>> following caveats:
> > >>>
> > >>> * Signed integers are recommended as the "compatible" and preferred 
> > >>> choice
> > >>> * Most algorithms in the reference libraries should choose signed over
> > >>> unsigned when generating indices
> > >>> * Libraries may choose to promote unsigned to signed (e.g. in Java) if
> > >>> they don't support unsigned well
> > >>>
> > >>> I can't say I'm thrilled about having to maintain extra code for the
> > >>> unsigned case, but it also seems like it would not do great harm
> > >>> overall. Additionally, if you are certain that the indices are all
> > >>> non-negative, then you can use the same code to process both intX and
> > >>> uintX -- we use this trick in the Take implementation in C++ to
> > >>> generate half as much binary code.
> > >>>
> > >>> Thoughts?
> > >>>
> > >>> Thanks
> > >>> Wes
> > >>>
> > >>> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
> > >>> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Re: [Format][C++] Offering limited support for unsigned dictionary indices

Reply via email to