Github user koeninger commented on the pull request:
https://github.com/apache/spark/pull/11921#issuecomment-200977714
If all you care about is counts, why wouldn't you at least just write
inputStream.map(x => 1).window(Seconds(slidingWindowInterval),
Seconds(intervalSeconds)).foreachRDD(rdd => rdd.count())
to say nothing of reduceByWindow, etc. Similarly if the first thing you
were doing with your message was converting it to json and projecting a couple
of fields, why not do that before windowing?
Windowing or caching directly over the stream just seems like a bad idea
even with this patch, because you're including all kinds of stuff in the
serialization that you just don't need.
I'd almost rather override persist() to print an error level message saying
"hey, you probably want to persist or window after your transformation, not
directly on a KafkaRDD"
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]