[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-18791:
-------------------------------------

    Assignee: Tathagata Das

> Stream-Stream Joins
> -------------------
>
>                 Key: SPARK-18791
>                 URL: https://issues.apache.org/jira/browse/SPARK-18791
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Michael Armbrust
>            Assignee: Tathagata Das
>
> Stream stream join is a much requested, but missing feature in Structured 
> Streaming. While the join API exists in Datasets and DataFrames, it throws 
> UnsupportedOperationException when applied between two streaming 
> Datasets/DataFrames. To support this, we have to maintain the same semantics 
> as other Structured Streaming operations - the result of the operation after 
> consuming two data streams data till positions/offsets X and Y, respectively, 
> must be the same as a single batch join operation on all the data till 
> positions X and Y, respectively. To achieve this, the execution has to buffer 
> past data (i.e. streaming state) from each stream, so that future data can be 
> matched against past data. Here is the set of a few high-level requirements. 
> - Buffer past rows as streaming state (using StateStore), and joining with 
> the past rows.
> - Support state cleanup using the event time watermark when possible.
> - Support different types of joins (inner, left outer, right outer is in 
> highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich 
> -> write to S3])
> - Support cascading join operations (i.e. joining more than 2 streams)
> - Support multiple output modes (Append mode is in highest demand for 
> enabling ETL/enrichment type use cases)
> All the work to incrementally build this is going represented by this JIRA, 
> with specific subtasks for each step. At this point, this is the rough 
> direction as follows:
> - Implement stream-stream inner join in Append Mode, supporting multiple 
> cascaded joins.
> - Extends it stream-stream left/right outer join in Append Mode



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to