[GitHub] [arrow] jorisvandenbossche commented on issue #37476: [Python] pyarrow.array silently rewrites dictionary types to use signed integers for indexes

via GitHub Thu, 31 Aug 2023 00:14:12 -0700


jorisvandenbossche commented on issue #37476:
URL: https://github.com/apache/arrow/issues/37476#issuecomment-1700486940


   Thanks for the report. I think the problem is that in the python->arrow 
conversion (`python_to_arrow.cc`), it is using a DictionaryBuilder under the 
hood, which is created with:
   
   
https://github.com/apache/arrow/blob/9b6be29f431705ce1f85cc218c66d4d03698f06b/cpp/src/arrow/builder.cc#L312-L320
   
   This is passing `exact_index_type = False`, and that essentially means that 
it will use an adaptive int builder (that starts with the bitwidth size you 
specified, but can still grow eg from int32 to int64 if needed). 
   
   Maybe one way to fix the signed vs unsigned change is to let it use a 
AdaptiveIntBuilder vs AdaptiveUIntBuilder, depending on the signedness of the 
original index type. That would preserve the signedness, but keep the ability 
to let the bitwidth grow if necessary to convert the data to a dictionary type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #37476: [Python] pyarrow.array silently rewrites dictionary types to use signed integers for indexes

Reply via email to