Hi,

I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/ where
Wes writes

> "string or binary data would come with additional overhead while pandas
> continues to use Python objects in its memory representation"


Pandas 1.0 introduced StringDType which I thought could help with the issue
(I didn't check the internals, I assume they still use Python objects, just
not Numpy, but I had nothing to lose).

My issue is that if I create an PyArrow array with a = pa.array(["aaaaa",
"bbbbb"]*100000000) and call .to_pandas() the dtype of the dataframe is
still "object". I tried to add a types_mapper function (docs is not really
helpful so I've simply created def mapper(t): return pd.StringDtype) but it
didn't work.

Is this a future feature? Would it help anything? For now I'm happy to use
category/dictionary data, as the column is low cardinality and it makes it
5x faster, but I was hoping for a simpler solution. I don't know the
internals but if "aaaaa" and "bbbbb" are immutable strings it shouldn't
really differ from using Category type (even if it's creating python
objects for them, as it can be done with 2 immutable objects). Converting
compressed parquet -> pyarrow is fast (less than 10 seconds), it's pyarrow
-> pandas which is slow, running for 7 minutes (so I think pyarrow already
has a nice implementation)

Best regards,
Adam Lippai

Reply via email to