I think in the interest of not having the spec fork we should probably do this. It is partially our fault for not providing better documentation in Schema.fbs (and potentially more thorough integration tests).
Maybe we should explicitly disallow uint64 which provides the biggest headache for the JVM and it is probably used rarely if at all for dictionaries. It is possible even uint32 might be unnecessary? -Micah On Fri, Jun 26, 2020 at 5:58 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi folks, > > At the moment, using unsigned integers for dictionary indices/codes > isn't exactly forbidden by the metadata [1], which says that the > indices must be "positive integers". Meanwhile, the columnar format > specification says > > "When a field is dictionary encoded, the values are represented by an > array of signed integers representing the index of the value in the > dictionary. The memory layout for a dictionary-encoded array is the > same as that of a primitive signed integer layout." > > I was looking at a discussion in RAPIDS about this topic [2] > > When we drafted the columnar specification for this, the intention as > I recall was only to support signed integer indices. The rationale > basically is that: > > * Better cross platform / language support for signed (e.g. certainly > in the JVM) > * Supporting 4 instead of 8 index types is less burdensome for the > developer and means less code to generate to support them > * Unsigned wraparound bugs could pass silently > > I think it would be feasible to support the unsigned indices with the > following caveats: > > * Signed integers are recommended as the "compatible" and preferred choice > * Most algorithms in the reference libraries should choose signed over > unsigned when generating indices > * Libraries may choose to promote unsigned to signed (e.g. in Java) if > they don't support unsigned well > > I can't say I'm thrilled about having to maintain extra code for the > unsigned case, but it also seems like it would not do great harm > overall. Additionally, if you are certain that the indices are all > non-negative, then you can use the same code to process both intX and > uintX -- we use this trick in the Take implementation in C++ to > generate half as much binary code. > > Thoughts? > > Thanks > Wes > > [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280 > [2]: https://github.com/rapidsai/cudf/pull/5501#issuecomment-649934509 >