Github user JoshRosen commented on the pull request:

    https://github.com/apache/incubator-spark/pull/635#issuecomment-35967399
  
    @mengxr For your examples (training multiple models in parallel and 
broadcast join), why is `allCollect` better than directly using broadcast 
variables?  Why do you want to access your model/table through an RDD with a 
single array value?  It seems like you'd need to use something like 
`zipPartitions()` to pair up the model with your data, so why not just directly 
reference the broadcast variable in a `mapPartitions()` on your varying dataset?
    
    In the past, I've thought about implementing a 
`SparkContext.broadcast[T](rdd: RDD[T]): Broadcast[T]` for creating broadcast 
variables from RDDs.  This can be implemented through peer-to-peer broadcasting 
of the  RDD fragments rather than broadcasting the entire collected RDD from 
the driver; this avoids bottlenecking on the driver's send bandwidth.
    
    I tried to prototype efficient broadcasting of RDDs, but ran into some 
difficulties.  My implementation used its own Broadcast subclass that fetched 
RDD partitions from remote block managers when trying to access the broadcast 
variable's value.  The problem is that we need to run some job on the 
broadcasted RDD to generate the partitions that the Broadcast subclass will 
fetch, and we need to ensure that those partitions are computed and stored 
before we attempt to fetch them.  Thus, any transformation that references a 
BroadcastRDD variable must declare a dependency on the stage that produces 
those blocks.  Usually, RDDs declare their dependencies when they're 
constructed; with BroadcastRDDs, we'd need to actually look inside the UDF to 
find any references in order to build the proper lineage.
    
    On a related note, it would be nice to implement an `allReduce` function 
that's equivalent to `reduce().broadcast()` but performs similar optimizations 
to avoid coordinating through the driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to