Huy1Ng commented on issue #1294: URL: https://github.com/apache/datafusion-ballista/issues/1294#issuecomment-3728329196
I gave it another try again: https://github.com/Huy1Ng/datafusion-ballista/tree/try-enable-view-types. The implementation allows view type to work, but the performance is like 2x worse than just using normal string. The problem is that `BinaryViewArray` type is not transport-friendly. `arrow-rs` always try to split `RecordBatch` to fit into grpc message size: https://github.com/apache/arrow-rs/blob/964daecce22c08b60288bb4d00028ed950dabd56/arrow-flight/src/encode.rs#L614 . For `BinaryViewArray` and `StringViewArray` (and possibly `DictionaryArray`), the slicing operation only reduce the pointers array, but the buffer remains the same and must be included in the `RecordBatch`, otherwise the pointers would be invalid. `gc` would help here, but it's extremely expensive. I can squeeze more performance by being smarter at when to do `gc`, but I wonder if there could be another less complex solution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
