kazuyukitanimura commented on issue #7858: URL: https://github.com/apache/arrow-datafusion/issues/7858#issuecomment-1769584341
Thank you @milenkovicm for stress-testing and the improvement ideas. Here are my quick thoughts; For 1 & 2, I think we can break down the `emit` into smaller chunks even before `sort_batch()`. We can flush sorted chunks to disk before start sorting the next groups of unsorted chunks. In that way, we can reduce (control) the memory footprint for sorting. We still need but a smaller budget for sorting. This is somewhat a mixed solution of @alamb 's two strategies. By this comment https://github.com/apache/arrow-datafusion/blob/7acd8833cc5d03ba7643d4ae424553c7681ccce8/datafusion/physical-plan/src/aggregates/row_hash.rs#L675 the plan was to break down a large chunk and writing in parallel (maybe pooling) so that single thread disk writing won't be blocking. For 3, if `ResourcesExhausted` happens but `group_values.len()` is shorter than `self.batch_size`, that means even a single batch worth of data did not fit in memory. In that case, we need to stop proceeding from that point. The assumption is to have a memory conf to hold at least one batch in memory. Alternatively `batch_size` can be reduced. I think we still need these checks even we spill smaller data at a time. Cheers! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
