Github user markhamstra commented on the pull request: https://github.com/apache/incubator-spark/pull/635#issuecomment-35825888 Huh? I don't get the point of these at all. At first glance, allCollect looks like a really bad idea. Collecting the entire contents of an RDD to the driver process only to immediately turn around and push all of that data (or in this case, multiple copies of the data!) back across the network is an anti-pattern and generally a very poor design choice that cannot scale to large data -- if you can handle all of the data within the driver process, then why are you using a distributed, big-data framework in the first place? allCollectBroadcast makes even less sense to me. Some workflows do demand collecting a relatively small amount of data to the driver and then broadcasting a small amount back to the workers for use in further computations, but why would I then want to go through the extra step of pushing the broadcast values into a strange-looking RDD instead of just using the broadcast variable directly? It's going to take a lot of persuading to convince me that either of these are things we want to promote and support in the 1.0 API. That doesn't mean that I'm not listening, but I am far from convinced at this point.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---