Thanks for the detailed answer. It's indeed 5-10% faster with the correct arguments you provided, but the performance is far from the categorical type based solution. I'll track the linked pandas issue. I'm not a C++ dev, but I'll be happy to test, benchmark or add docs.
Best regards, Adam Lippai On Thu, Jun 18, 2020 at 10:08 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Hi Adam, > > On Wed, 17 Jun 2020 at 13:07, Adam Lippai <a...@rigo.sk> wrote: > > > Hi, > > > > I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/ > > where > > Wes writes > > > > > "string or binary data would come with additional overhead while pandas > > > continues to use Python objects in its memory representation" > > > > > > Pandas 1.0 introduced StringDType which I thought could help with the > issue > > (I didn't check the internals, I assume they still use Python objects, > just > > not Numpy, but I had nothing to lose). > > > > My issue is that if I create an PyArrow array with a = pa.array(["aaaaa", > > "bbbbb"]*100000000) and call .to_pandas() the dtype of the dataframe is > > still "object". I tried to add a types_mapper function (docs is not > really > > helpful so I've simply created def mapper(t): return pd.StringDtype) but > it > > didn't work. > > > > Two caveats here: 1) the function needs to return an *instance* and not a > class (so `return pd.StringDtype()`), and 2) this keyword only works for > Table.to_pandas right now (this is certainly something that should either > be fixed or either be clarified in the docs). > > So taking your example array, and putting it in a Table, and then > converting to pandas, the types_mapper keyword works: > > >>> table = pa.table({'a': a}) > >>> df = table.to_pandas(types_mapper={pa.string(): pd.StringDtype()}.get) > >>> df.dtypes > a string > dtype: object > > Now, the pandas string dtype is currently still using Python objects to > store the strings (so similarly as using an object dtype). There are plans > to store the strings more efficiently (eg using arrow's string array memory > layout), see https://github.com/pandas-dev/pandas/issues/8640/. > > But so right now, if you have many repeated strings, I would still go for > the category/dictionary type, as that will be a lot more efficient for > further processing in pandas. > > > > > > > Is this a future feature? Would it help anything? For now I'm happy to > use > > category/dictionary data, as the column is low cardinality and it makes > it > > 5x faster, but I was hoping for a simpler solution. I don't know the > > internals but if "aaaaa" and "bbbbb" are immutable strings it shouldn't > > really differ from using Category type (even if it's creating python > > objects for them, as it can be done with 2 immutable objects). Converting > > compressed parquet -> pyarrow is fast (less than 10 seconds), it's > pyarrow > > -> pandas which is slow, running for 7 minutes (so I think pyarrow > already > > has a nice implementation) > > > > There is a `deduplicate_objects` keyword in to_pandas exactly for this (to > avoid creating multiple Python objects for identical strings). > However, as indicated above, and depending on what your further processing > steps are in pandas, using a categorical/dictionary type might still be the > better option. > > Joris > > > > > > Best regards, > > Adam Lippai > > >