Hello beam community,
I'm looking for beam api/implementation of joins with asymmetric arrival time.
For example, for a same message, a message sent event arrives at 9am, but
message read event may arrive at 11am or even next day. So when joining two
streams of those two kinds of events together, we need to keep the buffer/state
for message sent events long enough to be able to catch late/delayed events of
message read.
Currently with beam, I saw a few example of implementing joins with CoGroupBy
and FixedWindow, so I'm thinking a few options here:
1. Increase the size of fixed windows, however, this will add latency to the
application
2. If I choose early firing to reduce latency, then I would have the
correctness issue
* If I choose to accumulate event, I would end up with duplicated
results each time early firing is triggered/
* If I choose to discard event, then results would miss the late/delayed
scenario described above.
Any comments or suggestions on how to solve this problem? I just found that
Spark Streaming can provide what I need in this blog post:
https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html
Thanks,
Khai
[https://databricks.com/wp-content/uploads/2018/03/image5.png]<https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html>
Introducing Stream-Stream Joins in Apache Spark 2.3 - The Databricks
Blog<https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html>
databricks.com
Since we introduced Structured Streaming in Apache Spark 2.0, it has supported
joins (inner join and some type of outer joins) between a streaming and a
static DataFrame/Dataset.With the release of Apache Spark 2.3.0, now available
in Databricks Runtime 4.0 as part of Databricks Unified Analytics Platform, we
now support stream-stream joins.In this post, we will explore a canonical case
of how ...