[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...

markhamstra Sat, 22 Feb 2014 23:41:27 -0800

Github user markhamstra commented on the pull request:

    https://github.com/apache/incubator-spark/pull/635#issuecomment-35825888
  
    Huh?  I don't get the point of these at all.
    
    At first glance, allCollect looks like a really bad idea.  Collecting the 
entire contents of an RDD to the driver process only to immediately turn around 
and push all of that data (or in this case, multiple copies of the data!) back 
across the network is an anti-pattern and generally a very poor design choice 
that cannot scale to large data -- if you can handle all of the data within the 
driver process, then why are you using a distributed, big-data framework in the 
first place?
    
    allCollectBroadcast makes even less sense to me.  Some workflows do demand 
collecting a relatively small amount of data to the driver and then 
broadcasting a small amount back to the workers for use in further 
computations, but why would I then want to go through the extra step of pushing 
the broadcast values into a strange-looking RDD instead of just using the 
broadcast variable directly?
    
    It's going to take a lot of persuading to convince me that either of these 
are things we want to promote and support in the 1.0 API.  That doesn't mean 
that I'm not listening, but I am far from convinced at this point.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: SPARK-1122: allCollect functions for...

Reply via email to