[GitHub] [arrow] jorisvandenbossche commented on issue #34583: [Python] Pyarrow DictionaryArray.dictionary_decode mangling strings

via GitHub Thu, 16 Mar 2023 07:11:40 -0700


jorisvandenbossche commented on issue #34583:
URL: https://github.com/apache/arrow/issues/34583#issuecomment-1472056816


   Thanks for the reproducer! I can't directly test now (too big for my 
laptop's memory), but my first guess about what is going on is that the 
DictionaryArray itself has a dictionary of string type, while the decoded array 
should use a large_string type, as the full decoded data is too big to fit into 
a normal string array (which is using offsets of int32). 
   And from the output you show, it seems to be using a normal StringArray.
   
   If that guess is correct, a possible workaround for now could be to first 
chunk the dictionary array `postcode_dict` before decoding it.
   
   ```
   postcode_dict_chunked = pa.chunked_array([postcode_dict.slice(i, 100_000) 
for i in range(0, len(postcode_dict), 100_000)])
   ```
   
   Now, we don't have a "decode_dictionary" method on a ChunkedArray, so doing 
this manually with a take:
   
   ```
   postcode_indices_chunked = pa.chunked_array([chunk.indices for chunk in 
postcode_dict_chunked.chunks])
   postcode_decoded = postcode_dict.dictionary.take(postcode_indices_chunked)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #34583: [Python] Pyarrow DictionaryArray.dictionary_decode mangling strings

Reply via email to