andishgar commented on PR #46230: URL: https://github.com/apache/arrow/pull/46230#issuecomment-2838025640
> If we're not preallocating buffers which could be preallocated, I think it'd be better to correct the other kernels to do so rather than remove an optimization from what is probably the most-used kernel My changes impact 8 kernels. Based on your feedback, I understand I should rewrite the entire binary casting to use binary types. Let me explain each change in detail. 1- Offset String -> Offset String The allocated offset buffer is dropped in zero-copy casting https://github.com/apache/arrow/blob/e6d0edcf24499fdcd7b16a793f1cab4f61754c92/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L308 If the input and output offset types differ, a new buffer is allocated https://github.com/apache/arrow/blob/e6d0edcf24499fdcd7b16a793f1cab4f61754c92/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L251-L253 https://github.com/apache/arrow/blob/e6d0edcf24499fdcd7b16a793f1cab4f61754c92/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L279-L281 2- String View -> Offset String The implementation overlooks the allocation of a second buffer https://github.com/apache/arrow/blob/e6d0edcf24499fdcd7b16a793f1cab4f61754c92/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L341 Additionally, there’s a risk of generating utf8() and binary() types with lengths exceeding 2 billion, as discussed [here](https://github.com/apache/arrow/issues/46128#issue-299261058). 3- Offset String -> String View This requires updates to the compute infrastructure first ([issue](https://github.com/apache/arrow/issues/46177#issue-3003593798)). Additional challenges with slicing and casting are noted [here](https://github.com/apache/arrow/issues/46128#issue-2992610582). 4- String View -> String View The code works correctly with the current logic in the compute function., however if we resolve [this](https://github.com/apache/arrow/issues/46177#issuecomment-2814187762), ` MemAllocation::NO_PREALLOCATE ` should be used to prevent extra casting 5- Fixed -> String View It has the same issue of Offset string-> String View 6- Fixed -> Offset string Is it possible to have a separate path for scalar and array? https://github.com/apache/arrow/blob/e6d0edcf24499fdcd7b16a793f1cab4f61754c92/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L597-L612 7-Fixed -> Fixed The second buffer is allocated, However, there is no need 8- String| String View -> Fixed First : The second buffer is allocated but ignored by the implementation. Second: Why the data buffer is copied for casting from offset string to FixedSizeString https://github.com/apache/arrow/blob/e6d0edcf24499fdcd7b16a793f1cab4f61754c92/cpp/src/arrow/compute/kernels/scalar_cast_string.cc#L645-L661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
