Re: [I] Extra repartitions in physical plan and strange optimizer behavior overall [arrow-datafusion]

via GitHub Fri, 26 Jan 2024 12:57:28 -0800


alamb commented on issue #9011:
URL: 
https://github.com/apache/arrow-datafusion/issues/9011#issuecomment-1912686214


   The original plan you show has a `TableScan` at the top -- is this a 
projection? Or is it a view definition somehow?
   
   My reading of the plan
   ```
   TableScan: ?table? projection=[project_id, user_id, created_at, event_id, 
event, str_0] <---- this says it needs all columns
     PartitionedAggregate: ...
       Filter: project_id = Int64(1) AND created_at >= 
TimestampNanosecond(1705419428144118000, None) AND created_at <= 
TimestampNanosecond(1706283428144118000, None) AND event = UInt16(13)
         Sort: project_id ASC NULLS LAST, user_id ASC NULLS LAST
           Repartition: Hash(project_id, user_id) partition_count=12
             Projection: project_id, user_id, created_at, event
               TableScan: ?table? projection=[project_id, user_id, created_at, 
event_id, event, str_0]]
   ```
   
   As for the repartitioning i think that is happening to try and run the 
filter in parallel It is somewhat messy but not obviously wrong to me
   
   The optimizer is going to try and satisfy the requirements stated by the 
topmost `ExecutionPlan` in your tree. Thus I think figuring out what the top 
most plan node is and what it is requesting of its input is the best way to 
understand what DataFusion is doing in this case


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Extra repartitions in physical plan and strange optimizer behavior overall [arrow-datafusion]

Reply via email to