milenkovicm commented on issue #7858:
URL:
https://github.com/apache/arrow-datafusion/issues/7858#issuecomment-1777239272
I've done some more testing with spilling and as I mentioned in my previous
message I don't see "small" spills.
`RepartitionExec` is another operator which is complaining about memory
consumption. To drill down to the problem I have changed `RepartitionExec` to
print warning instead of panicking when over given memory threshold.
If we take simple physical plan:
```
Optimized physical plan:
ProjectionExec: expr=[COUNT(*)@1 as COUNT(*), SUM(ta.id)@2 as SUM(ta.id),
SUM(ta.co)@3 as SUM(ta.co), SUM(ta.n)@4 as SUM(ta.n), uid@0 as uid]
GlobalLimitExec: skip=0, fetch=10
CoalescePartitionsExec
AggregateExec: mode=FinalPartitioned, gby=[uid@0 as uid],
aggr=[COUNT(*), SUM(ta.id), SUM(ta.co), SUM(ta.n)]
CoalesceBatchesExec: target_batch_size=8192
RepartitionExec: partitioning=Hash([uid@0], 4),
input_partitions=4
AggregateExec: mode=Partial, gby=[uid@2 as uid],
aggr=[COUNT(*), SUM(ta.id), SUM(ta.co), SUM(ta.n)]
ParquetExec: file_groups={4 groups: [[ ... ]]},
projection=[id, co, uid, n]
```
I have noticed that `RepartitionExec` memory warnings can be correlated to
spilling in `AggregateExec: mode=FinalPartitioned`.
My guess is that unbounded channel in `RepartitionExec` acts as a buffer
during `AggregateExec` spill.
Which bears another question what would be optimal way of handling memory
reservation in `RepartitionExec.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]