hvanhovell commented on a change in pull request #26127: [SPARK-29348][SQL] Add 
observable Metrics for Streaming queries
URL: https://github.com/apache/spark/pull/26127#discussion_r361951217
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -1848,6 +1848,54 @@ class Dataset[T] private[sql](
   @scala.annotation.varargs
   def agg(expr: Column, exprs: Column*): DataFrame = groupBy().agg(expr, exprs 
: _*)
 
+ /**
+  * Define (named) metrics to observe on the Dataset. This method returns an 
'observed' Dataset
+  * that returns the same result as the input, with the following guarantees:
+  * - It will compute the defined aggregates (metrics) on all the data that is 
flowing through the
+  *   Dataset at that point.
+  * - It will report the value of the defined aggregate columns as soon as we 
reach a completion
+  *   point. A completion point is either the end of a query (batch mode) or 
the end of a streaming
+  *   epoch. The value of the aggregates only reflects the data processed 
since the previous
+  *   completion point.
+  * Please note that continuous execution is currently not supported.
+  *
+  * The metrics columns must either contain a literal (e.g. lit(42)), or 
should contain one or
+  * more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). 
Expressions that
+  * contain references to the input Dataset's columns must always be wrapped 
in an aggregate
+  * function.
+  *
+  * A user can observe these metrics by either adding
+  * [[org.apache.spark.sql.streaming.StreamingQueryListener]] or a
+  * [[org.apache.spark.sql.util.QueryExecutionListener]] to the spark session.
 
 Review comment:
   We shouldn't document unstable features unless you commit to stabilizing 
them in the short term. Please don't add that example. A savy developer will 
know how work with this, and can weight the risk involved themselves.
   
   The same goes for unstable. Please don't mark it as unstable, the function 
itself is stable. It is stable for streaming. The batch case relies on 
`QueryExecutionListener` and isn't, however that API is marked as such so there 
shouldn't be any confusion there.
   
   @HeartSaVioR it would great if we can fix the streaming example.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to