dongjoon-hyun commented on a change in pull request #25856: [SPARK-29182][Core] 
Cache preferred locations of checkpointed RDD
URL: https://github.com/apache/spark/pull/25856#discussion_r327379987
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
 ##########
 @@ -82,14 +83,28 @@ private[spark] class ReliableCheckpointRDD[T: ClassTag](
     Array.tabulate(inputFiles.length)(i => new CheckpointRDDPartition(i))
   }
 
+  // Cache of preferred locations of checkpointed files.
+  private[spark] val cachedPreferredLocations: mutable.HashMap[Int, 
Seq[String]] =
+    mutable.HashMap.empty
 
 Review comment:
   The following assumption sounds weak to me. HDFS NN also returns the 
locations based on the data nodes situation and data nodes can die at any point 
of time. If a data node with replica dies, HDFS is going to recover it and 
returns different locations (including existing ones). This PR seems to imply 
Spark will have always a outdated(corrupted) set of host names. How do you 
think about that?
   > I think the locations of checkpointed files should not be changed during 
job execution.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to