Github user koeninger commented on the pull request:
https://github.com/apache/spark/pull/11921#issuecomment-200902045
KafkaRDD doesn't have a storage level. If you don't do any caching, and do
multiple actions on a KafkaRDD, it will pull from kafka each time. This is the
exact same as any other source where you're pulling from disk, as far as I know.
If you are doing caching, I'm not clear on why you would want to cache the
entire byte buffer + 2 timestamps + offset for each message, when you likely
only need a fraction of that for your job.
I'd expect typical jobs to look something like
stream.map(...).filter(..).cache(). If that is pulling in the underlying
message during serialization, rather than your domain object, that's a
problem... but again without some example code to reproduce it I'm not clear.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]