Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2411#issuecomment-55820856
  
    Hey so I think there are a few issues with this. Given the semantics of 
persisting RDD's I don't think it's really possible to express a "hit ratio" 
that makes sense. If I cache my RDD with MEMORY_AND_DISK, and the data is 
served from disk, is that considered a cache hit? We don't have a binary system 
of "cached, not cached", so reducing the result to a ratio doesn't make much 
sense.
    
    Another issue with this is that it has somewhat awkward semantics around 
pipelining. For instance:
    
    ```
    >>> val x = rdd1.cache().count
    
    # This will be at most 33% cache ratio, even if all partitions of x are 
served from cache
    >>> x.filter(...).filter(...).count
    
    # This will be at most 25% cache ratio, even if all partitions of x are 
served from cache
    >>> x.filter(...).filter(...).filter(...).count
    ```
    
    So I'd propose instead of this to augment the existing InputMetrics with a 
count of the number of partitions coming from each input source. That way we 
just give the user all relevant information. I think we almost have this 
already, we just need to add a partition counter for each input source.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to