Github user mengxr commented on the pull request:

    https://github.com/apache/incubator-spark/pull/635#issuecomment-35849312
  
    @markhamstra @pwendell For the use cases, this allCollect operation may be 
useful in the grid search for a good set of training parameters for machine 
learning problems. For example, if the dataset is only 500MB but training takes 
half an hour to finish and we have to try 100 different combinations of 
training parameters (e.g., rank, regularization constants, and termination 
tolerance), the wall-clock time can be reduced by distributing the dataset to 
multiple nodes and training in parallel. Another use case is the replicated 
join, though locality issues need to be addressed. I agree with you that the 
implementation is not efficient, which puts heavy load on the driver.
    
    @coderxiang , could you try to improve the implementation? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---

Reply via email to