Github user cmccabe commented on the pull request:

    https://github.com/apache/spark/pull/1486#issuecomment-49472027
  
    So, ideally we'd be able to set a different TaskLocality based on whether 
the replica were cached or not.  Right now, getPreferredLocations just returns 
a string, making this difficult to do.
    
    It seems like we have a few choices:
    1. simply reorder the replicas as this change does (disadvantage is we lose 
some locality information)
    2. change the type of getPreferredLocations to return a type containing 
(hostname, Locality), rather than simply string
    3. getPreferredLocations could continue to return strings, but we could add 
"cached:" to the front of some.
    4. we could add a new function to RDD which would be used when available to 
return this information.
    
    This patch is choice 1.
    
    Choice 2 might have some backwards compatibility issues.
    
    Choice 3 is a bit ugly, but is clearly the simplest.  Since colons are not 
valid characters in hostnames, it seems safe as well.
    
    Choice 4 is a bit trickier since if any code fails to implement the new 
function, we fall back to not knowing about cache locality, which isn't ideal.
    
    Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to