viirya edited a comment on pull request #30392:
URL: https://github.com/apache/spark/pull/30392#issuecomment-729168362


   > I suppose the potential overhead of treeReduce is that it may proceed in 
several phases, whereas reduce does it in one pass. If the executor-side reduce 
is unnecessary because there's one element, by the same token, does it take any 
time? I'd think it just returns the single element.
   
   I think so. In some cases, unnecessary executor-side reduce might invoke an 
additional map task although it just returns the single element. So this is 
just a minor concern for me.
   
   The other thing is, `takeOrdered` is introduced into RDD API earlier than 
`treeReduce`. So I guess we may not consider to use `treeReduce` in this place 
before. For the nature of `takeOrdered` that sequences of elements are 
collected to the driver and reduced. It sounds a good fit for `treeReduce` as 
we can partially reduce before collecting to the driver. There is an overhead 
but sounds like a trade-off. Driver side usually a bottleneck for such case. 
The driver-side reduce is not parallel and also bound to local memory for all 
data.
   
   I'm totally okay to close this if it doesn't sound good direction to go.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to