Re: [I] [Python] Pyarrow conversion from and to pandas fails for categorical variables with large dictionaries [arrow]

via GitHub Wed, 11 Sep 2024 01:44:54 -0700


jorisvandenbossche commented on issue #44048:
URL: https://github.com/apache/arrow/issues/44048#issuecomment-2343026089


   > The failure happens when the total number of characters reaches the size 
of an unsigned 32bit integer 
(`np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647`), 
indicating it may be an int32 Overflow issue.
   > ..
   > It seems from_dataframe avoids the error by leveraging a 'large_string' 
datatype
   
   It's indeed related to that. A single Array with the `string` type can only 
hold a limited total number of characters for all elements combined, because it 
uses int32 offsets. The `large_string` type on the other hand uses int64 
offsets 
([spec](https://arrow.apache.org/docs/dev/format/Columnar.html#variable-size-binary-layout)).
   
   The problem here is that when we are converting the pandas Categorical 
column, we convert the integer codes and the actual categories (the unique 
values) separately to a pyarrow array. And when converting the categories, we 
bump into the issue that it does not fit into a single array. At that point the 
`pa.array(..)` function will automatically fall back to returning a chunked 
array:
   
   ```python
   # using your above df
   >>> values = df["float_gran"].array
   >>> pa.array(values.categories.values)
   <pyarrow.lib.ChunkedArray object at 0x7f607b87c520>
   [
     [
       "0.00010000548144684096",
       "0.00010002117808627364",
       ...
       "0.9792001085756353",
       "0.9792001280159454"
     ],
     [
       "0.9792001297798442",
       "0.9792001326304284",
       ...
       "9.997302630371241e-05",
       "9.999832524965058e-05"
     ]
   ]
   ```
   
   But what causes the error then, is that we try to create the DictionaryArray 
using `from_arrays`, so simplified something like:
   
   ```python
   indices = pa.array(values.codes)
   dictionary = pa.array(values.categories)
   result = pa.DictionaryArray.from_arrays(indices, dictionary)
   ```
   
   and this method cannot handle the ChunkedArray input, it expects two Arrays. 
   
   This is a problem in our implementation, though, and something we should fix.
   
   ---
   
   What you can do on the short term:
   
   - Don't use a categorical (dictionary type) in this case .. Of course I 
suppose it is a made up example to illustrate the issue, and the real world use 
case might have a good reason to do so, but in general categorical type is 
mostly useful if you have repeated values (i.e. where the unique categories are 
a smaller array)
   - Specify that you want to get `large_string` for the resulting pyarrow 
dictionary type. This can be done through specifying a schema, although that is 
a bit inconvenient though. And you are correct that right now this doesn't get 
preserved in a roundtrip (this will get solved in pandas 3.0, though, because 
then pandas will start using large_string by default on their side as well)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Pyarrow conversion from and to pandas fails for categorical variables with large dictionaries [arrow]

Reply via email to