zanmato1984 commented on PR #43661:
URL: https://github.com/apache/arrow/pull/43661#issuecomment-3549830139

   Thanks for the explanation @scott-routledge2 ! I now see the problem and 
sorry for you to have to illustrate that again.
   
   IIUC there are two aspects in this particular use case:
   1. The cast implementation of `utf8` to `large_utf8` for a slice with 
non-zero offset exhibits a performance penalty of over-allocation and redundant 
zero-filling.
   2. The cast is implicitly relying on the thin `MakeExecBatch` doing a 
reluctant schema aligning.
   
   I think they are separated. And for 1, I agree with @pitrou that it is an 
arguable implementation choice we made for `BinaryToBinaryCastExec`. In other 
words, quote:
   > since the output of the cast inherits the offset from the input slice
   It doesn't necessarily have to be the case. We are well free to output an 
intact, zero-offset array, which is logically equal to the current output with 
non-zero offset. This should require no API changes.
   
   For 2, I think the requirement of outputting batches of unified schema 
(applying implicit casts when necessary) makes a lot of sense. However I'm 
wondering if we can doing it in an less-intrusive way, for example, applying 
the casts explicitly by leveraging existing mechanisms like 
`ScanOptions::projection` 
(https://github.com/apache/arrow/blob/5a480444da35fa26bc6952755510ad39df9f7002/cpp/src/arrow/dataset/scanner.h#L62).
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to