AlenkaF commented on issue #33059: URL: https://github.com/apache/arrow/issues/33059#issuecomment-4388939904
I will close this issue as it has been fixed by https://github.com/apache/arrow/pull/14106. As for the promoting of integer types for indices in case of the dictionary type, there is a comment connected to this in the docstrings: https://github.com/apache/arrow/blob/23cd1ff8f4e33b3207875e3395d2d6b1aeb1edc2/python/pyarrow/array.pxi#L188-L193 but I am not sure `uint` being promoted to `int` of same size fits here as this change seems to happen even if not necessary. I asked Copilot to help me dig through the code. I seems this is expected on the C++ side, see: https://github.com/apache/arrow/blob/61c96ca0612ae46ef05becfeb5f987197180cb2e/cpp/src/arrow/array/builder_dict.h#L671-L674 `DictionaryBuilder` uses `AdaptiveIntBuilder` to create indices and it does not utilize `AdaptiveUIntBuilder`. Looking at the format docs, I also found: > Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. here: https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout A separate issue can be opened in case this design decision needs to be discussed further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
