zanmato1984 commented on PR #43661: URL: https://github.com/apache/arrow/pull/43661#issuecomment-3549830139
Thanks for the explanation @scott-routledge2 ! I now see the problem and sorry for you to have to illustrate that again. IIUC there are two aspects in this particular use case: 1. The cast implementation of `utf8` to `large_utf8` for a slice with non-zero offset exhibits a performance penalty of over-allocation and redundant zero-filling. 2. The cast is implicitly relying on the thin `MakeExecBatch` doing a reluctant schema aligning. I think they are separated. And for 1, I agree with @pitrou that it is an arguable implementation choice we made for `BinaryToBinaryCastExec`. In other words, quote: > since the output of the cast inherits the offset from the input slice It doesn't necessarily have to be the case. We are well free to output an intact, zero-offset array, which is logically equal to the current output with non-zero offset. This should require no API changes. For 2, I think the requirement of outputting batches of unified schema (applying implicit casts when necessary) makes a lot of sense. However I'm wondering if we can doing it in an less-intrusive way, for example, applying the casts explicitly by leveraging existing mechanisms like `ScanOptions::projection` (https://github.com/apache/arrow/blob/5a480444da35fa26bc6952755510ad39df9f7002/cpp/src/arrow/dataset/scanner.h#L62). Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
