HeartSaVioR commented on a change in pull request #26127: [SPARK-29348][SQL] Add observable Metrics for Streaming queries URL: https://github.com/apache/spark/pull/26127#discussion_r361946570
########## File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ########## @@ -1848,6 +1848,54 @@ class Dataset[T] private[sql]( @scala.annotation.varargs def agg(expr: Column, exprs: Column*): DataFrame = groupBy().agg(expr, exprs : _*) + /** + * Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset + * that returns the same result as the input, with the following guarantees: + * - It will compute the defined aggregates (metrics) on all the data that is flowing through the + * Dataset at that point. + * - It will report the value of the defined aggregate columns as soon as we reach a completion + * point. A completion point is either the end of a query (batch mode) or the end of a streaming + * epoch. The value of the aggregates only reflects the data processed since the previous + * completion point. + * Please note that continuous execution is currently not supported. + * + * The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or + * more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that + * contain references to the input Dataset's columns must always be wrapped in an aggregate + * function. + * + * A user can observe these metrics by either adding + * [[org.apache.spark.sql.streaming.StreamingQueryListener]] or a + * [[org.apache.spark.sql.util.QueryExecutionListener]] to the spark session. Review comment: Let's sort it out a bit. Was the batch use case excluded by intention due to the use of DeveloperApi? If that's the intention, then yes #27046 can just fix the slight bug on streaming example. If the concern is focused on having QueryExecution as private API and exposing it to DeveloperApi, I think it's time to fix that, as no one would complain for the change on major release ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
