[GitHub] spark pull request: [SPARK-14105][STREAMING] Deep copy each kafka ...

koeninger Thu, 24 Mar 2016 12:12:49 -0700

Github user koeninger commented on the pull request:

    https://github.com/apache/spark/pull/11921#issuecomment-200977714
  
    If all you care about is counts, why wouldn't you at least just write
    
    inputStream.map(x => 1).window(Seconds(slidingWindowInterval), 
Seconds(intervalSeconds)).foreachRDD(rdd => rdd.count())
    
    to say nothing of reduceByWindow, etc.  Similarly if the first thing you 
were doing with your message was converting it to json and projecting a couple 
of fields, why not do that before windowing?
    
    Windowing or caching directly over the stream just seems like a bad idea 
even with this patch, because you're including all kinds of stuff in the 
serialization that you just don't need.
    
    I'd almost rather override persist() to print an error level message saying 
"hey, you probably want to persist or window after your transformation, not 
directly on a KafkaRDD"



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14105][STREAMING] Deep copy each kafka ...

Reply via email to