[DISCUSS][Python] combine_chunks and copies

Spencer Nelson Wed, 18 Oct 2023 15:52:42 -0700

pyarrow.ChunkedArray.combine_chunks is a method which is documented as
"Flatten this ChunkedArray into a single non-chunked array."


Incidentally, it happens to *always* copy the underlying chunk data - even
if the ChunkedArray is composed of just a single contiguous chunk which
could be returned directly. That has major performance impact for my
particular application, which calls `combine_chunks` on all ChunkedArrays
to compact them. When there is one chunk, this copy is unnecessary, but my
application spends about 5% to 15% of its total runtime just on these
copies!

A workaround is trivial to implement, but this seems like an unnecessary
footgun. But the point has been raised that perhaps the incidental copy
that combine_chunks does is actually part of its API, since users might
depend on that copy. This was brought up in a PR [0] and an issue [1].

My discussion topic: is this side-effect a part of the combine_chunks API?
If it is, I think it should be documented as such, opening the space for a
new method which avoids the unnecessary copy. If not, I think we should
improve its performance.

---

[0]: "Optimize combine_chunks when there is only one chunk"
https://github.com/apache/arrow/pull/37319
[1]: "Concatenating a single array is a compaction utility"
https://github.com/apache/arrow/issues/37878

[DISCUSS][Python] combine_chunks and copies

Reply via email to