I'll propose a patch to Columnar.rst about this. I think it's OK to
allow uint64 indices (so we don't have to revisit the issue in a year
or two or...) but the specification should say that they are not
preferred.

On Fri, Jun 26, 2020 at 7:20 PM Paul Taylor <ptaylor.apa...@gmail.com> wrote:
>
> Responding to this comment from GitHub[1]:
>
> > If we had to make a bet about what % of dictionaries empirically are
> > between 128 and 255 elements, I would bet that the percentage is
> > small. If it turned out that 40% of dictionaries fell in that range
> > then I would agree that this makes sense.
>
> I agree, the case where you'd use uint8 vs. int16 isn't very motivating,
> since those are probably small tables or temporary allocations inside
> larger operations.
>
> The case where you'd use uint16 vs. int32 is more motivating, since
> dictionaries between 32767 and 65535 elements could easily back larger
> columns and saving up to 1GiB on the keys can quickly add up.
>
> > I would recommend that the specification advise
> > against use of 64-bit indices at all unless that are actually needed
> > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > UINT32_MAX elements
>
> Agreed.
>
> > but doesn't strike me as being a common occurrence
>
> This is somewhat common in certain graph operations on large datasets,
> but I concede I may not be a representative sample of Arrow users :)
>
> Paul
>
> 1. https://github.com/rapidsai/cudf/pull/5501#issuecomment-649936352
>
> On 6/26/20 1:53 PM, Wes McKinney wrote:
> > I think that situations where you need uint64 indices are likely to be
> > exceedingly esoteric. I would recommend that the specification advise
> > against use of 64-bit indices at all unless that are actually needed
> > to represent the data (i.e. dictionaries have more than INT32_MAX /
> > UINT32_MAX elements, but doesn't strike me as being a common
> > occurrence)
> >
> > IIUC, the only change that would be necessary would be a revision to
> > Columnar.rst and perhaps some language augmentation in Schema.fbs, and
> > we might want to add some integration tests for unsigned indices to
> > probe whether each language supports them. The changes needed to
> > support unsigned integers in C++ are probably not that invasive but I
> > haven't taken a close look at it yet
> >
> > On Fri, Jun 26, 2020 at 3:23 PM Paul Taylor <ptaylor.apa...@gmail.com> 
> > wrote:
> >> If positive integers are expected, I'm in favor of supporting unsigned
> >> index types. I was surprised at Arrow C++ restriction on signed indices
> >> in the RAPIDS thread, perhaps it's newer than when I ported the logic in 
> >> JS.
> >>
> >> Based on the flatbuffer schemas, dictionary indices could technically be
> >> any Arrow type, which I assumed was to allow for more
> >> complex/exotic/custom indexing schemes in the future. JS will allow you
> >> to specify _any_ Arrow type as the dictionary codes, though using a
> >> non-numeric type without a custom enumerator is UB.
> >>
> >> I'm also curious about how the restriction on dictionary index types
> >> interacts with streaming delta dictionaries. In theory, you could have a
> >> streaming data source produce enough delta dictionaries such that the
> >> total dictionary size grows beyond 2^31-1 elements.
> >>
> >> I think that's a valid use-case of delta dictionaries assuming Arrow
> >> aggregates the dictionaries into multiple RecordBatches (or a
> >> ChunkedArray), which is what JS does. But if that were allowed, we would
> >> have to allow 64-bit (signed or unsigned) dictionary index types.
> >>
> >> Paul
> >>
> >>
> >> On 6/26/20 5:58 AM, Wes McKinney wrote:
> >>> hi folks,
> >>>
> >>> At the moment, using unsigned integers for dictionary indices/codes
> >>> isn't exactly forbidden by the metadata [1], which says that the
> >>> indices must be "positive integers". Meanwhile, the columnar format
> >>> specification says
> >>>
> >>> "When a field is dictionary encoded, the values are represented by an
> >>> array of signed integers representing the index of the value in the
> >>> dictionary. The memory layout for a dictionary-encoded array is the
> >>> same as that of a primitive signed integer layout."
> >>>
> >>> I was looking at a discussion in RAPIDS about this topic [2]
> >>>
> >>> When we drafted the columnar specification for this, the intention as
> >>> I recall was only to support signed integer indices. The rationale
> >>> basically is that:
> >>>
> >>> * Better cross platform / language support for signed (e.g. certainly
> >>> in the JVM)
> >>> * Supporting 4 instead of 8 index types is less burdensome for the
> >>> developer and means less code to generate to support them
> >>> * Unsigned wraparound bugs could pass silently
> >>>
> >>> I think it would be feasible to support the unsigned indices with the
> >>> following caveats:
> >>>
> >>> * Signed integers are recommended as the "compatible" and preferred choice
> >>> * Most algorithms in the reference libraries should choose signed over
> >>> unsigned when generating indices
> >>> * Libraries may choose to promote unsigned to signed (e.g. in Java) if
> >>> they don't support unsigned well
> >>>
> >>> I can't say I'm thrilled about having to maintain extra code for the
> >>> unsigned case, but it also seems like it would not do great harm
> >>> overall. Additionally, if you are certain that the indices are all
> >>> non-negative, then you can use the same code to process both intX and
> >>> uintX -- we use this trick in the Take implementation in C++ to
> >>> generate half as much binary code.
> >>>
> >>> Thoughts?
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280
> >>> [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509

Reply via email to