milenkovicm commented on issue #7858:
URL: 
https://github.com/apache/arrow-datafusion/issues/7858#issuecomment-1777239272

   I've done some more testing with spilling and as I mentioned in my previous 
message I don't see "small" spills.
   
   `RepartitionExec` is another operator which is complaining about memory 
consumption. To drill down to the problem I have changed `RepartitionExec` to 
print warning instead of panicking when over given memory threshold.
   
   If we take simple physical plan:
   
   ```
   Optimized physical plan:
     ProjectionExec: expr=[COUNT(*)@1 as COUNT(*), SUM(ta.id)@2 as SUM(ta.id), 
SUM(ta.co)@3 as SUM(ta.co), SUM(ta.n)@4 as SUM(ta.n), uid@0 as uid]
       GlobalLimitExec: skip=0, fetch=10
         CoalescePartitionsExec
           AggregateExec: mode=FinalPartitioned, gby=[uid@0 as uid], 
aggr=[COUNT(*), SUM(ta.id), SUM(ta.co), SUM(ta.n)]
             CoalesceBatchesExec: target_batch_size=8192
               RepartitionExec: partitioning=Hash([uid@0], 4), 
input_partitions=4
                 AggregateExec: mode=Partial, gby=[uid@2 as uid], 
aggr=[COUNT(*), SUM(ta.id), SUM(ta.co), SUM(ta.n)]
                   ParquetExec: file_groups={4 groups: [[ ... ]]}, 
projection=[id, co, uid, n]
   ```
   
   I have noticed that `RepartitionExec` memory warnings can be correlated to 
spilling in `AggregateExec: mode=FinalPartitioned`. 
   
   My guess is that unbounded channel in `RepartitionExec` acts as a buffer 
during `AggregateExec` spill. 
   Which bears another question what would be optimal way of handling memory 
reservation in `RepartitionExec.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to