rangadi commented on code in PR #39931:
URL: https://github.com/apache/spark/pull/39931#discussion_r1120674821
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala:
##########
@@ -96,6 +98,25 @@ trait StateStoreReader extends StatefulOperator {
/** An operator that writes to a StateStore. */
trait StateStoreWriter extends StatefulOperator with PythonSQLMetrics { self:
SparkPlan =>
+ /**
+ * Produce the output watermark for given input watermark (ms).
+ *
+ * In most cases, this is same as the criteria of state eviction, as most
stateful operators
+ * produce the output from two different kinds:
+ *
+ * 1. without buffering
+ * 2. with buffering (state)
+ *
+ * The state eviction happens when event time exceeds a "certain threshold
of timestamp", which
+ * denotes a lower bound of event time values for output (output watermark).
+ *
+ * The default implementation provides the input watermark as it is. Most
built-in operators
+ * will evict based on min input watermark and ensure it will be minimum of
the event time value
+ * for the output so far (including output from eviction). Operators which
behave differently
Review Comment:
If the above is correct, I would like to propose an explicit contract
written down in the doc comment for this method:
_An operator guarantees that it will not emit record with an event timestamp
lower than its output watermark_.
It might be obvious, but I think it is better to explicitly state it. cc:
@jerrypeng
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala:
##########
@@ -96,6 +98,25 @@ trait StateStoreReader extends StatefulOperator {
/** An operator that writes to a StateStore. */
trait StateStoreWriter extends StatefulOperator with PythonSQLMetrics { self:
SparkPlan =>
+ /**
+ * Produce the output watermark for given input watermark (ms).
+ *
+ * In most cases, this is same as the criteria of state eviction, as most
stateful operators
+ * produce the output from two different kinds:
+ *
+ * 1. without buffering
+ * 2. with buffering (state)
+ *
+ * The state eviction happens when event time exceeds a "certain threshold
of timestamp", which
+ * denotes a lower bound of event time values for output (output watermark).
+ *
+ * The default implementation provides the input watermark as it is. Most
built-in operators
+ * will evict based on min input watermark and ensure it will be minimum of
the event time value
+ * for the output so far (including output from eviction). Operators which
behave differently
Review Comment:
Making sure my understanding is correct : As a concrete example: If windowed
aggregation (say `count(over 5 minutes`) emits a record X, it's
'Event-timestamp' is 'window-end'. With this, the stateful operator for count()
can guarantee that it will not any record in the future with lower timestamp
than its output watermark. Is that correct?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]