xkrogen commented on pull request #30725: URL: https://github.com/apache/spark/pull/30725#issuecomment-745394031
Thanks for the further explanation, that is very helpful. Seems like potentially the comment in `HadoopRDD#getJobConf` should be updated since the concurrency bugs have been fixed in Hadoop since 2.7.0, a pretty old version. There is still one point I don't understand. It seems that the key for the `JobConf` in the cache is based on the ID of the RDD, not a per-partition key: ``` protected val jobConfCacheKey: String = "rdd_%d_job_conf".format(id) ``` So I would expect there to be one cached entry in `hadoopJobMetadata` per RDD. How do we end up with one `JobConf` per partition? Is it because the `check if jobConf in cache -> if not put into cache` steps are not synchronized, and many threads simultaneously decide that the conf isn't present and then put many copies of the conf into the cache? Or have I missed something? Thanks for bearing with me as I try to understand this issue! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
