xkrogen commented on pull request #30725:
URL: https://github.com/apache/spark/pull/30725#issuecomment-745394031


   Thanks for the further explanation, that is very helpful. Seems like 
potentially the comment in `HadoopRDD#getJobConf`
   should be updated since the concurrency bugs have been fixed in Hadoop since 
2.7.0, a pretty old version.
   
   There is still one point I don't understand. It seems that the key for the 
`JobConf` in the cache is based on the ID of the RDD, not a per-partition key:
   ```
   protected val jobConfCacheKey: String = "rdd_%d_job_conf".format(id)
   ```
   So I would expect there to be one cached entry in `hadoopJobMetadata` per 
RDD. How do we end up with one `JobConf` per partition? Is it because the 
`check if jobConf in cache -> if not put into cache` steps are not 
synchronized, and many threads simultaneously decide that the conf isn't 
present and then put many copies of the conf into the cache? Or have I missed 
something?
   
   Thanks for bearing with me as I try to understand this issue!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to