[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

mateiz Fri, 18 Jul 2014 10:16:23 -0700

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1452#discussion_r15122396
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1195,21 +1196,32 @@ abstract class RDD[T: ClassTag](
       /**
        * Return whether this RDD has been checkpointed or not
        */
    -  def isCheckpointed: Boolean = {
    -    checkpointData.map(_.isCheckpointed).getOrElse(false)
    -  }
    +  def isCheckpointed: Boolean = checkpointData.exists(_.isCheckpointed)
     
       /**
        * Gets the name of the file to which this RDD was checkpointed
        */
    -  def getCheckpointFile: Option[String] = {
    -    checkpointData.flatMap(_.getCheckpointFile)
    -  }
    +  def getCheckpointFile: Option[String] = 
checkpointData.flatMap(_.getCheckpointFile)
     
       // 
=======================================================================
       // Other internal methods and fields
       // 
=======================================================================
     
    +  /**
    +   * Broadcasted copy of this RDD, used to dispatch tasks to executors. 
Note that we broadcast
    +   * the serialized copy of the RDD and for each task we will deserialize 
it, which means each
    +   * task gets a different copy of the RDD. This provides stronger 
isolation between tasks that
    +   * might modify state of objects referenced in their closures. This is 
necessary in Hadoop
    +   * where the JobConf/Configuration object is not thread-safe.
    +   */
    +  @transient private[spark] lazy val broadcasted: Broadcast[Array[Byte]] = 
{
    +    // TODO: Warn users about very large RDDs.
    --- End diff --
    
    It would be nice to add this in this patch, we can just choose a threshold



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

Reply via email to