viirya commented on issue #25856: [SPARK-29182][Core] Cache preferred locations of checkpointed RDD URL: https://github.com/apache/spark/pull/25856#issuecomment-541971274 @squito Thanks for review! Because we sample input RDD before running ALS, the sampled RDD becomes nondeterministic (SPARK-29042). We need to checkpoint it to make it deterministic in case of retries happen. > You could arguably make the same optimization in other places that read from hdfs, eg. HadoopRDD, though I suppose repeated scans of the same dataset are less common in that case? Yes, I think so. In case of repeated scans, I think users will use persist. In this case, persisted dataset will not query block locations. I also quickly checked HadoopRDD. Its locality info is come from InputSplit (InputSplitWithLocationInfo). So I guess for same HadoopRDD, the InputSplits are reused in repeated scans. We may not re-query data locality info. (Not pretty sure but just guess from quickly scanning the related code.)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
