jorisvandenbossche commented on issue #42026:
URL: https://github.com/apache/arrow/issues/42026#issuecomment-2154445411

   Profiling both showed a clear difference where there is a lot of usage of a 
hash table in the `to_pandas()` case, which reminded me that we have a 
`deduplicate_objects` option, which is False by default (which `to_numpy` uses) 
but set to True by default in `to_pandas()`.
   
   That explains the difference, and in a cases like this of all unique binary 
values, that just gives unnecessary overhead. Disabling it gives the expected 
similar performance for both `to_numpu` and `to_pandas`
   
   ```
   In [4]: %timeit _ = arr.to_numpy(zero_copy_only=False)
   375 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [5]: %timeit _ = arr.to_pandas(deduplicate_objects=False)
   380 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   
   Now I do wonder if we should consider turning it off by default in case of 
binary data .. (for strings, starting with pandas 3.0 which keeps arrow memory, 
the option will also not be relevant anymore)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to