dongjoon-hyun commented on a change in pull request #25856: [SPARK-29182][Core]
Cache preferred locations of checkpointed RDD
URL: https://github.com/apache/spark/pull/25856#discussion_r327379987
##########
File path: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
##########
@@ -82,14 +83,28 @@ private[spark] class ReliableCheckpointRDD[T: ClassTag](
Array.tabulate(inputFiles.length)(i => new CheckpointRDDPartition(i))
}
+ // Cache of preferred locations of checkpointed files.
+ private[spark] val cachedPreferredLocations: mutable.HashMap[Int,
Seq[String]] =
+ mutable.HashMap.empty
Review comment:
The following assumption sounds weak to me. HDFS NN also returns the
locations based on the data nodes situation and data nodes can die any time. If
a data node with replica dies, HDFS is going to recover it and returns
different locations (including existing ones). This PR seems to imply we have
always a corrupted set of host names.
> I think the locations of checkpointed files should not be changed during
job execution.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]