[GitHub] [spark] EnricoMi commented on a change in pull request #31905: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

GitBox Fri, 26 Mar 2021 11:22:32 -0700


EnricoMi commented on a change in pull request #31905:
URL: https://github.com/apache/spark/pull/31905#discussion_r602504816




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -683,6 +683,12 @@ class SparkSession private(
     ret
   }
 
+  def observation(expr: Column, exprs: Column*): Observation =

Review comment:
       > because the columns are from a certain DataFrame
   
   As long as the columns exist in the DataFrame, you can use it for any 
DataFrame. Column names should resolve when used on a specific DataFrame with 
compatible schema.
   
   ```
   Observation(count($"id"), sum($"downloads"))
   ```
   
   With your approach you could use the observation with varying columns. If 
you see Observation merely as a container to retrieve arbitrary results from 
`df.observe`, then your API makes most sense. If an Observation *is* its 
aggregation expressions, that is then applied on multiple compatible 
DataFrames, the suggested API is more conicse.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi commented on a change in pull request #31905: [SPARK-34806][SQL] Add Observation helper for Dataset.observe

Reply via email to