jorisvandenbossche commented on issue #42026: URL: https://github.com/apache/arrow/issues/42026#issuecomment-2154445411
Profiling both showed a clear difference where there is a lot of usage of a hash table in the `to_pandas()` case, which reminded me that we have a `deduplicate_objects` option, which is False by default (which `to_numpy` uses) but set to True by default in `to_pandas()`. That explains the difference, and in a cases like this of all unique binary values, that just gives unnecessary overhead. Disabling it gives the expected similar performance for both `to_numpu` and `to_pandas` ``` In [4]: %timeit _ = arr.to_numpy(zero_copy_only=False) 375 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit _ = arr.to_pandas(deduplicate_objects=False) 380 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Now I do wonder if we should consider turning it off by default in case of binary data .. (for strings, starting with pandas 3.0 which keeps arrow memory, the option will also not be relevant anymore) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
