Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

via GitHub Thu, 23 Nov 2023 01:06:56 -0800


EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1824030046


   This whole problem goes away with unnamed `Observation` instances:
   
   ```
   >>> observation1 = Observation()
   >>> observation2 = Observation()
   ```
   
   What is the purpose of the `name` anyway? It is not used anywhere, nor 
visible anyway. It is merely used in the query plan to give `CollectMetrics` 
some id.
   
   Historically, the `name` argument was introduced by the 
`Dataset.observe(str, Column, Column*)` method introduced in 3.0.0, as this 
`name` was the only identifier required / used to retrieve the metrics from the 
event listener.
   
   Then, in 3.3.0, I have introduced the `Observation` class to simplify 
retrieval of the metrics from the events, and added the `name` argument as 
optional, though there is no value in giving your observation a name.
   
   I'd propose to remove the optional `name` from Observation in 4.0.0, which 
fixes [SPARK-45656] by reducing complexity, rather than increasing complexity 
as per this PR.
   
   Arguably, `Dataset.observe(str, Column, Column*)` could be made 
`private[spark]` in 4.0.0 as well as `Dataset.observe(Observation, Column, 
Column*)` seems to be more user-friendly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Reply via email to