LucaCanali commented on a change in pull request #26953: 
[SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and throughput 
metrics using Spark Metrics system
URL: https://github.com/apache/spark/pull/26953#discussion_r369652547
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
 ##########
 @@ -434,7 +443,10 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 
     override def hasNext: Boolean = nextObj != null || {
       if (!eos) {
+        val startTime = System.nanoTime()
         nextObj = read()
+        val deltaTime = System.nanoTime()-startTime
+        PythonMetrics.incFromWorkerReadTime(deltaTime)
 
 Review comment:
   - I like the idea of adding SQLMetrics for Python UDF instrumentation and 
use them in the WEBUI. However, I think the work would rather fit for a 
separate JIRA/PR. The implementation details and the overhead of SQLMetrics are 
different from Dropwizard-based metrics, so probably we would like to have only 
a limited number of SQLMetrics instrumenting task activities in this area. Also 
the implementation of SQLMetrics for `[[PythonUDF]]` execution may require some 
important changes to the current plan evaluation code.
   
   - It is indeed the case that the “read time from worker” which is exposed to 
the users via the dropwizard library as “FetchResultsTimeFromWorkers” contains 
both socket I/O + deserialization time and Python UDF execution time. Measuring 
on the Python side could allow to separate the 2 time components, however 
currently I don’t see how to make a lightweight implementation for that. Python 
profiler has the possibility to measure on the Python side as you mentioned, 
but I see its usage more for debugging, while the proposed instrumentation is 
lightweight and intended to be used for production use cases too. Maybe future 
work can address this case if there is need?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to