ChristianBeilschmidt commented on issue #4724: URL: https://github.com/apache/arrow-rs/issues/4724#issuecomment-1689518778
> We generally try very hard to avoid copying data, as it is a major bottleneck and typically not desirable. The downside as you have discovered is potentially higher memory usage. Another area this turns up in is array slicing, which is zero-copy in a similar manner. There is no one-size-fits-all I guess. Zero-copy is one intended behavior and memory efficiencies or guarantees another one. Optimally, developers can decide on this. > I'm not averse to adding a kernel or function on Array to "compact" the underlying buffers of an array, but I wonder if you've experimented with writing the data in smaller batches? Such a kernel could be quite efficient, right? We only would need to compact the arrays that aren't compacted and leave the rest as it is. Would this be better than adding a config param to the decompress method? The zero-copy thing is more of a special case optimization than a general thing since you cannot point to the buffer data if the data is compressed. In the default case, you couldn't do it anyway. > Ultimately if the size of an encoded RecordBatch is large enough to cause concern, you're going to struggle to process it without blowing your memory budget regardless... A smaller batch size might let you get the best of both worlds, zero-copy without blowing your memory budget? Our use case is more concerned about having lots of records/arrays around in the system and keeping track of overall limits, e.g., having *N* data chunks (of different data) that do not get larger after compressing/decompressing them. Thus, it is plainly super inconvenient that data becomes larger after a compress-decompress step. I guess twice the size is the upper limit in the current approach. But thank you for the suggestion :+1: . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
