viirya edited a comment on pull request #30392: URL: https://github.com/apache/spark/pull/30392#issuecomment-729168362
> I suppose the potential overhead of treeReduce is that it may proceed in several phases, whereas reduce does it in one pass. If the executor-side reduce is unnecessary because there's one element, by the same token, does it take any time? I'd think it just returns the single element. I think so. In some cases, unnecessary executor-side reduce might invoke an additional map task although it just returns the single element. So this is just a minor concern for me. The other thing is, `takeOrdered` is introduced into RDD API earlier than `treeReduce`. So I guess we may not consider to use `treeReduce` in this place before. For the nature of `takeOrdered` that sequences of elements are collected to the driver and reduced. It sounds a good fit for `treeReduce` as we can partially reduce before collecting to the driver. There is an overhead but sounds like a trade-off. Driver side usually a bottleneck for such case. The driver-side reduce is not parallel and also bound to local memory for all data. I'm totally okay to close this if it doesn't sound good direction to go. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
