[GitHub] [spark] viirya commented on issue #25856: [SPARK-29182][Core] Cache preferred locations of checkpointed RDD

GitBox Mon, 14 Oct 2019 16:22:34 -0700

viirya commented on issue #25856: [SPARK-29182][Core] Cache preferred locations 
of checkpointed RDD
URL: https://github.com/apache/spark/pull/25856#issuecomment-541971274
 
 
   @squito Thanks for review!
   
   Because we sample input RDD before running ALS, the sampled RDD becomes 
nondeterministic (SPARK-29042). We need to checkpoint it to make it 
deterministic in case of retries happen.
   
   > You could arguably make the same optimization in other places that read 
from hdfs, eg. HadoopRDD, though I suppose repeated scans of the same dataset 
are less common in that case?
   
   Yes, I think so. In case of repeated scans, I think users will use persist. 
In this case, persisted dataset will not query block locations.
   
   I also quickly checked HadoopRDD. Its locality info is come from InputSplit 
(InputSplitWithLocationInfo). So I guess for same HadoopRDD, the InputSplits 
are reused in repeated scans. We may not re-query data locality info. (Not 
pretty sure but just guess from quickly scanning the related code.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on issue #25856: [SPARK-29182][Core] Cache preferred locations of checkpointed RDD

Reply via email to