[GitHub] spark pull request: [SPARK-14105][STREAMING] Deep copy each kafka ...

koeninger Thu, 24 Mar 2016 09:09:01 -0700

Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/11921#issuecomment-200902045
  
    KafkaRDD doesn't have a storage level.  If you don't do any caching, and do 
multiple actions on a KafkaRDD, it will pull from kafka each time.  This is the 
exact same as any other source where you're pulling from disk, as far as I know.
    
    If you are doing caching, I'm not clear on why you would want to cache the 
entire byte buffer + 2 timestamps + offset for each message, when you likely 
only need a fraction of that for your job.
    
    I'd expect typical jobs to look something like 
stream.map(...).filter(..).cache().  If that is pulling in the underlying 
message during serialization, rather than your domain object, that's a 
problem... but again without some example code to reproduce it I'm not clear.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14105][STREAMING] Deep copy each kafka ...

Reply via email to