tustvold commented on issue #6692: URL: https://github.com/apache/arrow-rs/issues/6692#issuecomment-2786116734
Perhaps it might be worth thinking about what use-cases we're trying to improve the performance of with this effort, this will ensure we design something that adequately addresses that use-case? If we're just talking about PrimitiveArray and StringViewArray types, then I suspect any performance delta is likely to be relatively minor as concatenating such arrays is already extremely cheap. If, however, we're looking to improve the performance of DictionaryArray, this becomes a whole different can of worms as any append-based interface is likely to struggle to efficiently handle arrays with heterogeneous dictionary values. I'm not sure if there is a good solution here tbh. The only array types where I could see such an append interface potentially having compelling performance benefits are (Large)StringArray, as it would allow eliding potentially large string copies. That being said this would be reliant on knowing the expected amount of string data up-front, which an append interface won't necessarily know, and use-cases should probably just use StringViewArray... The initial issue also stated > Memory Overhead / Performance Overhead for GarbageCollecting StringView: Buffering up several RecordBatches with StringView may consume significant amounts of memory for mostly filtered rows, which requires us to run gc periodically which actually slows some things down (see https://github.com/apache/datafusion/issues/11628) But I am honestly not entirely sure how an append interface really changes this, you need to perform some sort of GC at some point, it is unclear to me why doing it as part of a Coalesce operation or as part of the filter itself would behave materially differently... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
