[GitHub] [arrow] jorisvandenbossche commented on issue #35126: Slow pyarray table slice/take when the table has many chunks

via GitHub Fri, 14 Apr 2023 01:14:18 -0700


jorisvandenbossche commented on issue #35126:
URL: https://github.com/apache/arrow/issues/35126#issuecomment-1508115139


   > Is this slice slowness expected when a table has many chunks?
   
   I certainly wouldn't expect such a huge slowdown, but given that slicing a 
table with many chunks has inherently some more overhead (slicing each 
individual chunk each time), _some_ slowdown can be expected. But we should 
maybe see if there some overhead that can be reduced (from a quick profile, I 
don't directly see something obvious though, a large part of the time is spent 
in the actual `Array::Slice` / `ArrayData::Slice`).
   
   Maybe there could be some optimization when you are taking a slice that 
covers several chunks entirely, we don't actually call Slice on the chunks that 
are needed in full for the result.
   
   In general, chunks incur some overhead, and a batch size of 1024 is quite 
small for pyarrow (pyarrow/Arrow C++ is not optimized to work on such small 
batch sizes). So probably best to use a larger batch size.
   
   > Is there a way to tell pyarrow.concat_tables to return a table with a 
single chunk so I can avoid an extra copy by calling combine_chunks()?
   
   That's currently not possible, but note that `concat_tables` does not 
actually make a copy of the data (the original chunking is preserved), only 
`combine_chunks` does a copy. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #35126: Slow pyarray table slice/take when the table has many chunks

Reply via email to