EnricoMi commented on PR #43519: URL: https://github.com/apache/spark/pull/43519#issuecomment-1824030046
This whole problem goes away with unnamed `Observation` instances: ``` >>> observation1 = Observation() >>> observation2 = Observation() ``` What is the purpose of the `name` anyway? It is not used anywhere, nor visible anyway. It is merely used in the query plan to give `CollectMetrics` some id. Historically, the `name` argument was introduced by the `Dataset.observe(str, Column, Column*)` method introduced in 3.0.0, as this `name` was the only identifier required / used to retrieve the metrics from the event listener. Then, in 3.3.0, I have introduced the `Observation` class to simplify retrieval of the metrics from the events, and added the `name` argument as optional, though there is no value in giving your observation a name. I'd propose to remove the optional `name` from Observation in 4.0.0, which fixes [SPARK-45656] by reducing complexity, rather than increasing complexity as per this PR. Arguably, `Dataset.observe(str, Column, Column*)` could be made `private[spark]` in 4.0.0 as well as `Dataset.observe(Observation, Column, Column*)` seems to be more user-friendly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
