HyukjinKwon edited a comment on issue #23263: [SPARK-23674][ML] Adds Spark ML 
Events
URL: https://github.com/apache/spark/pull/23263#issuecomment-445626477
 
 
   > Visualizing a workflow is nice, but Spark's Pipelines are typically pretty 
straightforward and linear. I could imagine producing a nicer visualization 
than what you get from reading the Spark UI, although of course we already have 
some degree of history and data there.
   
   Another good thing that might have to be considered is, that we can interact 
this with other SQL events. For instance, where the input `Dataset` is 
originated. For instance, with current Apache Spark, I can visualise SQL 
operations as below:
   
   ![screen shot 2018-12-10 at 9 41 36 
am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png)
   
   I think we can combine those existing lineages together to easily understand 
where the data comes and goes.
   
   > These are just the hooks, right? someone would have to implement something 
to use these events. I see the value in the API to some degree, but with no 
concrete implementation, does it add anything for Spark users out of the box?
   
   Yes, right. It does not add anything to Spark users out of the box. It needs 
a custom implementation for a query listener. For instance,
   
   with the custom listener below:
   
   ```scala
   class CustomListener extends SparkListener
     def onOtherEvents(e) = e match {
       case e: MLEvent => // do something
       case _ => // pass
     }
   }
   ```
   
   There are two (existing) ways to use this.
   
   ```scala
   spark.sparkContext.addSparkListener(new CustomSparkListener)
   ```
   
   ```bash
   spark-submit ...\
     --conf spark.extraListeners=CustomSparkListener\
     ...
   ```
   
   It's also similar with other existing implementation in SQL side (catalog 
events described above in PR description).
   
   One actual example that I had with SQL query listener was that I had to 
close one connection every time after SQL execution.
   
   > Is that what someone would likely do? or would someone likely have to run 
Atlas to use this? If that's a good example of the use case, and Atlas is 
really about lineage and governance, is that the thrust of this change, to help 
with something to do with model lineage and reproducibility?
   
   There are two reasons.
   
   1. I think someone in general would likely utilise this feature like other 
event listeners. At least, I can see some interests going on outside.
   
       - SQL Listener
         - 
https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query
         - 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html
   
       - Streaming Query Listener
         - https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/
         -  
http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416
   
   2. Someone would likely run this via Atlas. The plugin mirror intentionally 
is exposed at 
[spark-atlas-connector](https://github.com/hortonworks-spark/spark-atlas-connector)
 so that anyone could do something about lineage and governance in Atlas, yes, 
as you said. Yes, I'm trying to show integrated lineages in Apache Spark but 
this is a missing hole. There had to be this change for it.
   
   (BTW, I hope it doesn't sound like I'm pushing this for my specific case. I 
know one explicit case that I have been working on, and I'm trying to show this 
transparently for better understanding of what this PR dose and proposes, 
rather than hiding what I do to make it sound like it only targets general 
cases. It indeed essentially targets not only my case but also all other 
usecases)
   
   +I updated the PR description based on this questions and answers.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to