jorisvandenbossche commented on issue #35126: URL: https://github.com/apache/arrow/issues/35126#issuecomment-1508115139
> Is this slice slowness expected when a table has many chunks? I certainly wouldn't expect such a huge slowdown, but given that slicing a table with many chunks has inherently some more overhead (slicing each individual chunk each time), _some_ slowdown can be expected. But we should maybe see if there some overhead that can be reduced (from a quick profile, I don't directly see something obvious though, a large part of the time is spent in the actual `Array::Slice` / `ArrayData::Slice`). Maybe there could be some optimization when you are taking a slice that covers several chunks entirely, we don't actually call Slice on the chunks that are needed in full for the result. In general, chunks incur some overhead, and a batch size of 1024 is quite small for pyarrow (pyarrow/Arrow C++ is not optimized to work on such small batch sizes). So probably best to use a larger batch size. > Is there a way to tell pyarrow.concat_tables to return a table with a single chunk so I can avoid an extra copy by calling combine_chunks()? That's currently not possible, but note that `concat_tables` does not actually make a copy of the data (the original chunking is preserved), only `combine_chunks` does a copy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
