jorisvandenbossche commented on issue #34583: URL: https://github.com/apache/arrow/issues/34583#issuecomment-1472056816
Thanks for the reproducer! I can't directly test now (too big for my laptop's memory), but my first guess about what is going on is that the DictionaryArray itself has a dictionary of string type, while the decoded array should use a large_string type, as the full decoded data is too big to fit into a normal string array (which is using offsets of int32). And from the output you show, it seems to be using a normal StringArray. If that guess is correct, a possible workaround for now could be to first chunk the dictionary array `postcode_dict` before decoding it. ``` postcode_dict_chunked = pa.chunked_array([postcode_dict.slice(i, 100_000) for i in range(0, len(postcode_dict), 100_000)]) ``` Now, we don't have a "decode_dictionary" method on a ChunkedArray, so doing this manually with a take: ``` postcode_indices_chunked = pa.chunked_array([chunk.indices for chunk in postcode_dict_chunked.chunks]) postcode_decoded = postcode_dict.dictionary.take(postcode_indices_chunked) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
