Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

via GitHub Tue, 18 Nov 2025 11:02:58 -0800


scott-routledge2 commented on PR #43661:
URL: https://github.com/apache/arrow/pull/43661#issuecomment-3549132175

> Could you clarify how the `O(N^2)` complexity arises? I read the issue
description but couldn’t quite follow it - probably because I’m not very
familiar with the dataset code - so a short walkthrough would be really helpful.

@zanmato1984 of course! It's not so much an issue with datasets as it is the
actual casting kernel in compute as @pitrou pointed out.

When casting small slices of size L of a largee array of length N, to cast
the first slice, we have to allocate a new offsets buffer for the up/downcasted
int offsets that has size L+1. The second slice will have offset L, and
[because of the way casting is
implemented](https://github.com/bodo-ai/arrow-bodo-fork/blob/5eaf553bfc7aa639fd67bd622b6b808e71fbba39/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L294),
we will actually need to allocate a buffer of size 2L+1 and zero out the first
L elements since the output of the cast inherits the offset from the input
slice. Continuing this pattern, the amount of work required is roughly, L + 2L
+ 3L + ... N ~ N^2/2L or O(N^2) if L is constant.

Where this relates to datasets is when reading a parquet file,
ScanAsyncBatches essentially is reading one row group at a time, and then
produces slices on top that row group, which then need to be casted in
MakeExecBatch, following the pattern described above.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

Reply via email to