jorisvandenbossche commented on issue #44048: URL: https://github.com/apache/arrow/issues/44048#issuecomment-2343026089
> The failure happens when the total number of characters reaches the size of an unsigned 32bit integer (`np.sum(df["float_gran"].cat.categories.str.len()) > 2_147_483_647`), indicating it may be an int32 Overflow issue. > .. > It seems from_dataframe avoids the error by leveraging a 'large_string' datatype It's indeed related to that. A single Array with the `string` type can only hold a limited total number of characters for all elements combined, because it uses int32 offsets. The `large_string` type on the other hand uses int64 offsets ([spec](https://arrow.apache.org/docs/dev/format/Columnar.html#variable-size-binary-layout)). The problem here is that when we are converting the pandas Categorical column, we convert the integer codes and the actual categories (the unique values) separately to a pyarrow array. And when converting the categories, we bump into the issue that it does not fit into a single array. At that point the `pa.array(..)` function will automatically fall back to returning a chunked array: ```python # using your above df >>> values = df["float_gran"].array >>> pa.array(values.categories.values) <pyarrow.lib.ChunkedArray object at 0x7f607b87c520> [ [ "0.00010000548144684096", "0.00010002117808627364", ... "0.9792001085756353", "0.9792001280159454" ], [ "0.9792001297798442", "0.9792001326304284", ... "9.997302630371241e-05", "9.999832524965058e-05" ] ] ``` But what causes the error then, is that we try to create the DictionaryArray using `from_arrays`, so simplified something like: ```python indices = pa.array(values.codes) dictionary = pa.array(values.categories) result = pa.DictionaryArray.from_arrays(indices, dictionary) ``` and this method cannot handle the ChunkedArray input, it expects two Arrays. This is a problem in our implementation, though, and something we should fix. --- What you can do on the short term: - Don't use a categorical (dictionary type) in this case .. Of course I suppose it is a made up example to illustrate the issue, and the real world use case might have a good reason to do so, but in general categorical type is mostly useful if you have repeated values (i.e. where the unique categories are a smaller array) - Specify that you want to get `large_string` for the resulting pyarrow dictionary type. This can be done through specifying a schema, although that is a bit inconvenient though. And you are correct that right now this doesn't get preserved in a roundtrip (this will get solved in pandas 3.0, though, because then pandas will start using large_string by default on their side as well) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
