Re: [I] [Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy [arrow]

via GitHub Fri, 07 Jun 2024 02:21:32 -0700


jorisvandenbossche commented on issue #42026:
URL: https://github.com/apache/arrow/issues/42026#issuecomment-2154445411


   Profiling both showed a clear difference where there is a lot of usage of a 
hash table in the `to_pandas()` case, which reminded me that we have a 
`deduplicate_objects` option, which is False by default (which `to_numpy` uses) 
but set to True by default in `to_pandas()`.
   
   That explains the difference, and in a cases like this of all unique binary 
values, that just gives unnecessary overhead. Disabling it gives the expected 
similar performance for both `to_numpu` and `to_pandas`
   
   ```
   In [4]: %timeit _ = arr.to_numpy(zero_copy_only=False)
   375 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [5]: %timeit _ = arr.to_pandas(deduplicate_objects=False)
   380 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   
   Now I do wonder if we should consider turning it off by default in case of 
binary data .. (for strings, starting with pandas 3.0 which keeps arrow memory, 
the option will also not be relevant anymore)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy [arrow]

Reply via email to