[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

tgravescs Thu, 05 May 2016 06:18:20 -0700

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11886#discussion_r62184963
  
    --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
    @@ -442,19 +443,19 @@ private[spark] class MapOutputTrackerMaster(conf: 
SparkConf)
             case None =>
               statuses = mapStatuses.getOrElse(shuffleId, Array[MapStatus]())
               epochGotten = epoch
    +          // If we got here, we failed to find the serialized locations in 
the cache, so we pulled
    +          // out a snapshot of the locations as "statuses"; let's 
serialize and return that
    +          byteArr = MapOutputTracker.serializeMapStatuses(statuses)
    +          logInfo("Size of output statuses for shuffle %d is %d bytes"
    +            .format(shuffleId, byteArr.length))
    --- End diff --
    
    This actually is not ok because the synchronized block is going to block 
the dispatcher threads which could then cause heartbeats and other messages to 
not be processed.  For small things its fine but once you get to larger ones 
you would have issues.
    
    See PR https://github.com/apache/spark/pull/12113 which fixes issues with 
large map output statuses and I believe fixes this same issue because only one 
thread will serialize and the rest will use the cached version..  Note that 
there are still improvements we could make with regards to caching. We could be 
more proactive to cache things up front, but since the cached statuses are all 
cleared when something changes you have to know which ones to cache again.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14065]Increase probability of using cac...

Reply via email to