[
https://issues.apache.org/jira/browse/SPARK-22552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sachin malhotra updated SPARK-22552:
------------------------------------
Description:
When unioning multiple kafka streams I learned that the resulting dataframe
only contains the data that exists in the dataframe that initiated the union
i.e. if df1.union(df2) (or a chaining of unions) the result will only contain
the rows that exist in df1.
Now to be more specific this occurs when data comes in during the same
micro-batch for all three streams. If you wait for each single row to be
processed for each stream the union does return the right results.
For example, if you have 3 kafka streams and you:
send message 1 to stream 1, WAIT for batch to finish, send message 2 to stream
2, wait for batch to finish, send message 3 to stream 3, wait for batch to
finish. Union will return the right data.
But if you,
send message 1,2,3, WAIT for batch to finish, you only receive data in the
first stream when unioning all three dataframes
was:
When unioning multiple kafka streams I learned that the resulting dataframe
only contains the data that exists in the dataframe that initiated the union
i.e. if df1.union(df2) (or a chaining of unions) the result will only contain
the rows that exist in df1.
Now to be more specific this occurs when data comes in during the same
micro-batch for all three streams. If you wait for each single row to be
processed for each stream the union does return the right results.
For example, if you have 3 kafka streams and you:
send message 1 to stream 1, WAIT for batch to finish, send message 2 to stream
2,
> Cannot Union multiple kafka streams
> -----------------------------------
>
> Key: SPARK-22552
> URL: https://issues.apache.org/jira/browse/SPARK-22552
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 2.2.0
> Reporter: sachin malhotra
> Assignee: Shixiong Zhu
> Fix For: 2.3.0, 2.2.2
>
>
> When unioning multiple kafka streams I learned that the resulting dataframe
> only contains the data that exists in the dataframe that initiated the union
> i.e. if df1.union(df2) (or a chaining of unions) the result will only contain
> the rows that exist in df1.
> Now to be more specific this occurs when data comes in during the same
> micro-batch for all three streams. If you wait for each single row to be
> processed for each stream the union does return the right results.
> For example, if you have 3 kafka streams and you:
> send message 1 to stream 1, WAIT for batch to finish, send message 2 to
> stream 2, wait for batch to finish, send message 3 to stream 3, wait for
> batch to finish. Union will return the right data.
> But if you,
> send message 1,2,3, WAIT for batch to finish, you only receive data in the
> first stream when unioning all three dataframes
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]