[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

JoshRosen Sun, 19 Oct 2014 13:52:09 -0700

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2844#discussion_r19063336
  
    --- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
    @@ -76,23 +87,20 @@ private[spark] class TorrentBroadcast[T: ClassTag](
        * @return number of blocks this broadcast variable is divided into
        */
       private def writeBlocks(): Int = {
    -    // For local mode, just put the object in the BlockManager so we can 
find it later.
    -    SparkEnv.get.blockManager.putSingle(
    -      broadcastId, _value, StorageLevel.MEMORY_AND_DISK, tellMaster = 
false)
    -
    -    if (!isLocal) {
    -      val blocks = TorrentBroadcast.blockifyObject(_value)
    -      blocks.zipWithIndex.foreach { case (block, i) =>
    -        SparkEnv.get.blockManager.putBytes(
    -          BroadcastBlockId(id, "piece" + i),
    -          block,
    -          StorageLevel.MEMORY_AND_DISK_SER,
    -          tellMaster = true)
    -      }
    -      blocks.length
    -    } else {
    -      0
    +    // Store a copy of the broadcast variable in the driver so that tasks 
run on the driver
    +    // do not create a duplicate copy of the broadcast variable's value.
    +    SparkEnv.get.blockManager.putSingle(broadcastId, _value, 
StorageLevel.MEMORY_AND_DISK,
    +      tellMaster = false)
    --- End diff --
    
    The reason for this store is to avoid creating two copies of `_value` in 
the driver.  If we serialize and deserialize a broadcast variable on the driver 
and then attempt to access its value, then without this code we will end up 
going through the regular de-chunking code path, which will cause us to 
deserialize the serialized copy of `_value` and waste memory. 
    
    I believe that this serialization and deserialization can take place when 
tasks are run in local mode, since we still serialize tasks in order to help 
users be aware of serialization issues that would impact them if they moved to 
a cluster.  This complexity is another reason why I'm in favor of just 
scrapping all local-mode special-casing and configuring Spark to use a dummy 
LocalBroadcastFactory for local mode instead of whichever setting the user 
specified.  That would be a larger, more-invasive change, which is why I opted 
for the simpler fix here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

Reply via email to