Github user wesleymiao commented on the pull request:
https://github.com/apache/spark/pull/5871#issuecomment-100392654
Hi, @tdas , will you be able to take a look at this PR? My team in Autodesk
is trying to adopt Spark Streaming to do some real-time analytics for our
services' API logs to serve various purposes, among which is to monitor our
services' health in real-time based on API log statistics. Then we run into
this situation where we need to perform reduceByKeyAndWindow on an already
windowed DStream.
What I'd like to achieve is to do multiple-level reduce on the source of
API logging stream. For example - let's say the source logging stream is at
interval 1 second. The first level aggregation would be every 1 minute, which
is the 60 intervals of the source stream. The second level would be every 1
hour. The third level would be 1 day. And we can do more levels if we want.
What I hope is that at each level we'll do reduceByKeyAndWindow so that its
aggregation can be done over its immediate previous level, instead of always
aggregating over the source stream. Level 3's reduceByKey will be based on
level 2's result, level 2 is based one level 1 and level 1 is based on the
source stream.
I would think this approach will be more efficient than always reducing
over the source stream, particularly for the higher level (like daily and
weekly) aggregation.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]