[
https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Liwei Lin updated SPARK-19932:
------------------------------
Description:
{code}
spark
.readStream // schema: (word, eventTime), like ("a", 10),
("a", 11), ("b", 12) ...
...
.withWatermark("eventTime", "10 seconds")
.dropDuplicates("word") // note: "eventTime" is not part of the key
columns
...
{code}
As shown above, right now if watermark is specified for a streaming
dropDuplicates query, but not specified as the key columns, then we'll still
get the correct answer, but the state just keeps growing and will never get
cleaned up.
The reason is, the watermark attribute is not part of the key of the state
store in this case. We're not saving event time information in the state store.
was:
<code>
spark
.readStream // schema: (word, eventTime), like ("a", 10),
("a", 11), ("b", 12) ...
...
.withWatermark("eventTime", "10 seconds")
.dropDuplicates("word") // note: "eventTime" is not part of the key
columns
...
<code>
As shown above, right now if watermark is specified for a streaming
dropDuplicates query, but not specified as the key columns, then we'll still
get the correct answer, but the state just keeps growing and will never get
cleaned up.
The reason is, the watermark attribute is not part of the key of the state
store in this case. We're not saving event time information in the state store.
> Also save event time into StateStore for certain cases
> ------------------------------------------------------
>
> Key: SPARK-19932
> URL: https://issues.apache.org/jira/browse/SPARK-19932
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Affects Versions: 2.1.0
> Reporter: Liwei Lin
>
> {code}
> spark
> .readStream // schema: (word, eventTime), like ("a", 10),
> ("a", 11), ("b", 12) ...
> ...
> .withWatermark("eventTime", "10 seconds")
> .dropDuplicates("word") // note: "eventTime" is not part of the key
> columns
> ...
> {code}
> As shown above, right now if watermark is specified for a streaming
> dropDuplicates query, but not specified as the key columns, then we'll still
> get the correct answer, but the state just keeps growing and will never get
> cleaned up.
> The reason is, the watermark attribute is not part of the key of the state
> store in this case. We're not saving event time information in the state
> store.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]