Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/126#discussion_r10579167
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -181,15 +178,50 @@ private[spark] class MapOutputTracker(conf:
SparkConf) extends Logging {
}
}
+/**
+ * MapOutputTracker for the workers. This uses BoundedHashMap to keep
track of
+ * a limited number of most recently used map output information.
+ */
+private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends
MapOutputTracker(conf) {
+
+ /**
+ * Bounded HashMap for storing serialized statuses in the worker. This
allows
+ * the HashMap stay bounded in memory-usage. Things dropped from this
HashMap will be
+ * automatically repopulated by fetching them again from the driver. Its
okay to
+ * keep the cache size small as it unlikely that there will be a very
large number of
+ * stages active simultaneously in the worker.
+ */
+ protected val mapStatuses = new BoundedHashMap[Int, Array[MapStatus]](
--- End diff --
Right, what TD is saying is that this particular map in MOTWorker is not
concerned with stage IDs (but rather with shuffle IDs). In other words, the
driver doesn't need to communicate stage information to the Executors, since
the Executors do not maintain maps that depend on stage IDs, AFAIA.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---