[jira] [Updated] (SPARK-18791) Stream-Stream Joins

Tathagata Das (JIRA) Mon, 18 Sep 2017 15:04:24 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tathagata Das updated SPARK-18791:
----------------------------------
    Description: 
Stream stream join is a much requested, but missing feature in Structured 
Streaming. While the join API exists in Datasets and DataFrames, it throws 
UnsupportedOperationException when applied between two streaming 
Datasets/DataFrames. To support this, we have to maintain the same semantics as 
other Structured Streaming operations - the result of the operation after 
consuming two data streams data till positions/offsets X and Y, respectively, 
must be the same as a single batch join operation on all the data till 
positions X and Y, respectively. To achieve this, the execution has to buffer 
past data (i.e. streaming state) from each stream, so that future data can be 
matched against past data. Here is the set of a few high-level requirements. 

- Buffer past rows as streaming state (using StateStore), and joining with the 
past rows.
- Support state cleanup using the event time watermark when possible.
- Support different types of joins (inner, left outer, right outer is in 
highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich 
-> write to S3])
- Support cascading join operations (i.e. joining more than 2 streams)
- Support multiple output modes (Append mode is in highest demand for enabling 
ETL/enrichment type use cases)

All the work to incrementally build this is going represented by this JIRA, 
with specific subtasks for each step. At this point, this is the rough 
direction as follows:
- Implement stream-stream inner join in Append Mode, supporting multiple 
cascaded joins.
- Extends it stream-stream left/right outer join in Append Mode



  was:
Stream stream join is a much requested, but missing feature in Structured 
Streaming. While the join API exists in Datasets and DataFrames, it throws 
UnsupportedOperationException when applied between two streaming 
Datasets/DataFrames. To support this, we have to maintain the same semantics as 
other Structured Streaming operations - the result of the operation after 
consuming two data streams data till positions/offsets X and Y, respectively, 
must be the same as a single batch join operation on all the data till 
positions X and Y, respectively. To achieve this, the execution has to buffer 
past data (i.e. streaming state) from each stream, so that future data can be 
matched against past data. Here is the set of a few high-level requirements. 

- Buffer past rows as streaming state (using StateStore), and joining with the 
past rows.
- Support state cleanup using the event time watermark when possible.
- Support different types of joins (inner, left outer, right outer being the 
most common requirements based )
- Support cascading join operations (i.e. joining more than 2 streams)
- Support multiple output modes (Append mode is in highest demand for enabling 
ETL/enrichment type usecases [kafka -> enrich -> write to S3])

All the work to incrementally build this is going represented by this JIRA, 
with specific subtasks for each step. At this point, this is the rough 
direction as follows:
- Implement
This JIRA is to track 




> Stream-Stream Joins
> -------------------
>
>                 Key: SPARK-18791
>                 URL: https://issues.apache.org/jira/browse/SPARK-18791
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Michael Armbrust
>
> Stream stream join is a much requested, but missing feature in Structured 
> Streaming. While the join API exists in Datasets and DataFrames, it throws 
> UnsupportedOperationException when applied between two streaming 
> Datasets/DataFrames. To support this, we have to maintain the same semantics 
> as other Structured Streaming operations - the result of the operation after 
> consuming two data streams data till positions/offsets X and Y, respectively, 
> must be the same as a single batch join operation on all the data till 
> positions X and Y, respectively. To achieve this, the execution has to buffer 
> past data (i.e. streaming state) from each stream, so that future data can be 
> matched against past data. Here is the set of a few high-level requirements. 
> - Buffer past rows as streaming state (using StateStore), and joining with 
> the past rows.
> - Support state cleanup using the event time watermark when possible.
> - Support different types of joins (inner, left outer, right outer is in 
> highest demand for ETL/enrichment type use cases [kafka -> best-effort enrich 
> -> write to S3])
> - Support cascading join operations (i.e. joining more than 2 streams)
> - Support multiple output modes (Append mode is in highest demand for 
> enabling ETL/enrichment type use cases)
> All the work to incrementally build this is going represented by this JIRA, 
> with specific subtasks for each step. At this point, this is the rough 
> direction as follows:
> - Implement stream-stream inner join in Append Mode, supporting multiple 
> cascaded joins.
> - Extends it stream-stream left/right outer join in Append Mode



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18791) Stream-Stream Joins

Reply via email to