Thanks for the detailed answer.
It's indeed 5-10% faster with the correct arguments you provided, but the
performance is far from the categorical type based solution.
I'll track the linked pandas issue. I'm not a C++ dev, but I'll be happy to
test, benchmark or add docs.

Best regards,
Adam Lippai

On Thu, Jun 18, 2020 at 10:08 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi Adam,
>
> On Wed, 17 Jun 2020 at 13:07, Adam Lippai <a...@rigo.sk> wrote:
>
> > Hi,
> >
> > I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/
> > where
> > Wes writes
> >
> > > "string or binary data would come with additional overhead while pandas
> > > continues to use Python objects in its memory representation"
> >
> >
> > Pandas 1.0 introduced StringDType which I thought could help with the
> issue
> > (I didn't check the internals, I assume they still use Python objects,
> just
> > not Numpy, but I had nothing to lose).
> >
> > My issue is that if I create an PyArrow array with a = pa.array(["aaaaa",
> > "bbbbb"]*100000000) and call .to_pandas() the dtype of the dataframe is
> > still "object". I tried to add a types_mapper function (docs is not
> really
> > helpful so I've simply created def mapper(t): return pd.StringDtype) but
> it
> > didn't work.
> >
>
> Two caveats here: 1) the function needs to return an *instance* and not a
> class (so `return pd.StringDtype()`), and 2) this keyword only works for
> Table.to_pandas right now (this is certainly something that should either
> be fixed or either be clarified in the docs).
>
> So taking your example array, and putting it in a Table, and then
> converting to pandas, the types_mapper keyword works:
>
> >>> table = pa.table({'a': a})
> >>> df = table.to_pandas(types_mapper={pa.string(): pd.StringDtype()}.get)
> >>> df.dtypes
> a    string
> dtype: object
>
> Now, the pandas string dtype is currently still using Python objects to
> store the strings (so similarly as using an object dtype). There are plans
> to store the strings more efficiently (eg using arrow's string array memory
> layout), see https://github.com/pandas-dev/pandas/issues/8640/.
>
> But so right now, if you have many repeated strings, I would still go for
> the category/dictionary type, as that will be a lot more efficient for
> further processing in pandas.
>
>
>
> >
> > Is this a future feature? Would it help anything? For now I'm happy to
> use
> > category/dictionary data, as the column is low cardinality and it makes
> it
> > 5x faster, but I was hoping for a simpler solution. I don't know the
> > internals but if "aaaaa" and "bbbbb" are immutable strings it shouldn't
> > really differ from using Category type (even if it's creating python
> > objects for them, as it can be done with 2 immutable objects). Converting
> > compressed parquet -> pyarrow is fast (less than 10 seconds), it's
> pyarrow
> > -> pandas which is slow, running for 7 minutes (so I think pyarrow
> already
> > has a nice implementation)
> >
>
> There is a `deduplicate_objects` keyword in to_pandas exactly for this (to
> avoid creating multiple Python objects for identical strings).
> However, as indicated above, and depending on what your further processing
> steps are in pandas, using a categorical/dictionary type might still be the
> better option.
>
> Joris
>
>
> >
> > Best regards,
> > Adam Lippai
> >
>

Reply via email to