GitHub user sahilTakiar opened a pull request: https://github.com/apache/spark/pull/19413
[SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE ## What changes were proposed in this pull request? Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call. ## How was this patch tested? Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sahilTakiar/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19413.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19413 ---- commit 680f32c311e33784e11763c109488d528178efc8 Author: Sahil Takiar <stak...@cloudera.com> Date: 2017-10-02T20:44:23Z [SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org