Imran Rashid created SPARK-5467:
-----------------------------------
Summary: DStreams should provide windowing based on timestamps
from the data (as opposed to wall clock time)
Key: SPARK-5467
URL: https://issues.apache.org/jira/browse/SPARK-5467
Project: Spark
Issue Type: New Feature
Components: Streaming
Reporter: Imran Rashid
DStreams currently only let you window based on wall clock time. This doesn't
work very well when you're loading historical logs that are already sitting
around, because they'll all go into one window. DStreams should provide a way
for you to window based on a field of the incoming data. This would be useful
if you want to either (1) bootstrap a streaming app from some logs or (2) test
out the behavior of your app on historical logs, eg. for correctness or
performance.
I think there are some open questions here, such as whether the input data
sources need to be sorted by time, how batches get triggered etc., but it seems
like an important use case.
This just came up on the mailing list:
http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html
And I think it is also what was this Jira was getting at:
https://issues.apache.org/jira/browse/SPARK-4427
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]