Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/635#issuecomment-35849312 @markhamstra @pwendell For the use cases, this allCollect operation may be useful in the grid search for a good set of training parameters for machine learning problems. For example, if the dataset is only 500MB but training takes half an hour to finish and we have to try 100 different combinations of training parameters (e.g., rank, regularization constants, and termination tolerance), the wall-clock time can be reduced by distributing the dataset to multiple nodes and training in parallel. Another use case is the replicated join, though locality issues need to be addressed. I agree with you that the implementation is not efficient, which puts heavy load on the driver. @coderxiang , could you try to improve the implementation?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---