Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

via GitHub Tue, 18 Nov 2025 15:16:09 -0800


zanmato1984 commented on PR #43661:
URL: https://github.com/apache/arrow/pull/43661#issuecomment-3549830139

Thanks for the explanation @scott-routledge2 ! I now see the problem and
sorry for you to have to illustrate that again.

IIUC there are two aspects in this particular use case:
1. The cast implementation of `utf8` to `large_utf8` for a slice with
non-zero offset exhibits a performance penalty of over-allocation and redundant
zero-filling.
2. The cast is implicitly relying on the thin `MakeExecBatch` doing a
reluctant schema aligning.

I think they are separated. And for 1, I agree with @pitrou that it is an
arguable implementation choice we made for `BinaryToBinaryCastExec`. In other
words, quote:
> since the output of the cast inherits the offset from the input slice
It doesn't necessarily have to be the case. We are well free to output an
intact, zero-offset array, which is logically equal to the current output with
non-zero offset. This should require no API changes.

For 2, I think the requirement of outputting batches of unified schema
(applying implicit casts when necessary) makes a lot of sense. However I'm
wondering if we can doing it in an less-intrusive way, for example, applying
the casts explicitly by leveraging existing mechanisms like
`ScanOptions::projection`
(https://github.com/apache/arrow/blob/5a480444da35fa26bc6952755510ad39df9f7002/cpp/src/arrow/dataset/scanner.h#L62).

Thanks.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43660: [C++] Add a `CastingGenerator` to Parquet Reader that applies required casts before slicing [arrow]

Reply via email to