Github user wesleymiao commented on the pull request:

    https://github.com/apache/spark/pull/5871#issuecomment-100392654
  
    Hi, @tdas , will you be able to take a look at this PR? My team in Autodesk 
is trying to adopt Spark Streaming to do some real-time analytics for our 
services' API logs to serve various purposes, among which is to monitor our 
services' health in real-time based on API log statistics. Then we run into 
this situation where we need to perform reduceByKeyAndWindow on an already 
windowed DStream. 
    
    What I'd like to achieve is to do multiple-level reduce on the source of 
API logging stream. For example - let's say the source logging stream is at 
interval 1 second. The first level aggregation would be every 1 minute, which 
is the 60 intervals of the source stream. The second level would be every 1 
hour. The third level would be 1 day. And we can do more levels if we want.
    
    What I hope is that at each level we'll do reduceByKeyAndWindow so that its 
aggregation can be done over its immediate previous level, instead of always 
aggregating over the source stream. Level 3's reduceByKey will be based on 
level 2's result, level 2 is based one level 1 and level 1 is based on the 
source stream.
    
    I would think this approach will be more efficient than always reducing 
over the source stream, particularly for the higher level (like daily and 
weekly) aggregation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to