Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2411#issuecomment-55820856
Hey so I think there are a few issues with this. Given the semantics of
persisting RDD's I don't think it's really possible to express a "hit ratio"
that makes sense. If I cache my RDD with MEMORY_AND_DISK, and the data is
served from disk, is that considered a cache hit? We don't have a binary system
of "cached, not cached", so reducing the result to a ratio doesn't make much
sense.
Another issue with this is that it has somewhat awkward semantics around
pipelining. For instance:
```
>>> val x = rdd1.cache().count
# This will be at most 33% cache ratio, even if all partitions of x are
served from cache
>>> x.filter(...).filter(...).count
# This will be at most 25% cache ratio, even if all partitions of x are
served from cache
>>> x.filter(...).filter(...).filter(...).count
```
So I'd propose instead of this to augment the existing InputMetrics with a
count of the number of partitions coming from each input source. That way we
just give the user all relevant information. I think we almost have this
already, we just need to add a partition counter for each input source.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]