[ 
https://issues.apache.org/jira/browse/SPARK-22552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sachin malhotra updated SPARK-22552:
------------------------------------
    Description: 
When unioning multiple kafka streams I learned that the resulting dataframe 
only contains the data that exists in the dataframe that initiated the union 
i.e. if df1.union(df2) (or a chaining of unions) the result will only contain 
the rows that exist in df1.

Now to be more specific this occurs when data comes in during the same 
micro-batch for all three streams. If you wait for each single row to be 
processed for each stream the union does return the right results. 

For example, if you have 3 kafka streams and you:

send message 1 to stream 1, WAIT for batch to finish, send message 2 to stream 
2, wait for batch to finish, send message 3 to stream 3, wait for batch to 
finish. Union will return the right data.

But if you,

send message 1,2,3, WAIT for batch to finish, you only receive data in the 
first stream when unioning all three dataframes


  was:
When unioning multiple kafka streams I learned that the resulting dataframe 
only contains the data that exists in the dataframe that initiated the union 
i.e. if df1.union(df2) (or a chaining of unions) the result will only contain 
the rows that exist in df1.

Now to be more specific this occurs when data comes in during the same 
micro-batch for all three streams. If you wait for each single row to be 
processed for each stream the union does return the right results. 

For example, if you have 3 kafka streams and you:

send message 1 to stream 1, WAIT for batch to finish, send message 2 to stream 
2, 


> Cannot Union multiple kafka streams
> -----------------------------------
>
>                 Key: SPARK-22552
>                 URL: https://issues.apache.org/jira/browse/SPARK-22552
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.2.0
>            Reporter: sachin malhotra
>            Assignee: Shixiong Zhu
>             Fix For: 2.3.0, 2.2.2
>
>
> When unioning multiple kafka streams I learned that the resulting dataframe 
> only contains the data that exists in the dataframe that initiated the union 
> i.e. if df1.union(df2) (or a chaining of unions) the result will only contain 
> the rows that exist in df1.
> Now to be more specific this occurs when data comes in during the same 
> micro-batch for all three streams. If you wait for each single row to be 
> processed for each stream the union does return the right results. 
> For example, if you have 3 kafka streams and you:
> send message 1 to stream 1, WAIT for batch to finish, send message 2 to 
> stream 2, wait for batch to finish, send message 3 to stream 3, wait for 
> batch to finish. Union will return the right data.
> But if you,
> send message 1,2,3, WAIT for batch to finish, you only receive data in the 
> first stream when unioning all three dataframes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to