Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/3913#issuecomment-73963012
Adding an `estimateSizeOf` method to `SparkContext` sounds reasonable to me.
I agree that there's not a great way to expose something like this for
Python. But I don't think the zaniness of Python-JVM interaction means that we
shouldn't expose useful functionality to pure-JVM apps.
> For RDD data, it might be slightly misleading here because of things like
serialization in-memory.
I think this is the kind of thing we can just document. Adding a separate
`estimateSerializedSizeOf` method would be helpful as well.
> I'm also not totally sure overall how accurate our memory estimation is
and it may get less so if we add smarter caching for SchemaRDD's.
I've found it to be very accurate in my experiments. We rely on its
accuracy for shuffle memory management and POJO caching, so to the extent that
it's inaccurate we've got bigger problems.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]