gene-bordegaray opened a new issue, #21148:
URL: https://github.com/apache/datafusion/issues/21148

   ### Is your feature request related to a problem or challenge?
   
   There has been notice that `RepartitionExec` is quite expensive in certain 
queries / scenarios recently:
   - 20-30x slower on certain array types (internally at Datadog)
   - weird behavior in distributed-datafusion on network shuffles depending on 
the number of output tasks 
(https://github.com/datafusion-contrib/datafusion-distributed/issues/385)
   
   It has been difficult to investigate / isolate the reason for this due to 
lack of granularity of metrics provided in the `RepartitionExec` operator. As 
of now we are only provided:
   - `send_time`: time spent pulling the next batch from input stream (mixed 
spill, channel send, etc.)
   - `repartition_time`: big bucket for repartition work (mixed routing and 
rebuilding batches from routed indices)
   - `fetch_time`: per output partition, covered the whole public batch path
   
   ### Describe the solution you'd like
   
   I would like to introduce more granular metrics that will isolate where 
repartition is spending its time:
   - fetch_time: unchanged
   - repartition_time: now the end-to-end total repartition time
   - route_time: the time to distribute row indices to output partitions
   - batch_build_time: the time to build the record batches
   - channel_wait_time per output partition: the time waiting for channel 
capacity / send(...) to complete
   - spill_write_time: per output partition, the time writing spilled batches
   - spill_read_wait_time: per output partition, time the consumer side waits 
for a spilled batch to become readable
   
   
   
   ### Describe alternatives you've considered
   
   I have considered other metrics but want to leave hot-path / overhead as 
small as possible for collection while still gaining good insight into the 
operator
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to