Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/6352#issuecomment-104947339
  
    Thanks for adding that test.  This patch looks like it's in pretty good 
shape, but before we consider merging there's one or two other minor 
corner-cases that I'd like to explore.
    
    In the current implementation of `getCacheLocs`, we first check to see 
whether the RDD's cache locations have been previously fetched; if so, we 
return the "cached" set of cache locations, and otherwise we fetch the set of 
locations from the block manager and store it in the cache.  This patch's 
optimization takes place _prior_ to checking / updating the `cacheLocs` map, 
meaning that it might slightly change behavior.  Specifically, I'm wondering 
what happened in the old code if we called `getCahceLocs()` on an RDD that 
wasn't cached, then cached the RDD , forced it to be computed, then called 
`getCacheLocs()` again as part of a different job.  In the old code, an empty 
set of cache locations would have been stored in the map on the first call, so 
I don't think the second call would see an updated set of cache locations 
unless we cleared the cache locations.  In the new code, non-cached RDDs won't 
ever cause entries to be stored in `cacheLocs`, so it's possible that the 
effects o
 f caching might become visible sooner after this patch than they would in the 
old code.  This might be safe, but if we make this change then it should be 
deliberate / knowingly.  If we want to be really conservative, maybe we should 
move the `storageLevel` check inside the `if (!cacheLocs.contains(rdd.id)` 
block in order to better preserve the old behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to