[GitHub] [spark] LucaCanali commented on pull request #31367: [SPARK-34265][PYTHON][SQL] Instrument Python UDF using SQL Metrics

GitBox Fri, 29 Jan 2021 03:06:17 -0800


LucaCanali commented on pull request #31367:
URL: https://github.com/apache/spark/pull/31367#issuecomment-769738766



   I have updated PR with the proposed list of metrics. The list currently 
contains 10 metrics.
   
   I can see the need to have just a few important metrics.
   However, I can also see that the Python UDF implementation is complex with 
many moving parts, metrics can help with troubleshooting with corner cases too, 
in those circumstances more metrics mean more flexibility and more chances to 
find where the root cause of the problem is.
   
   
   Another point for discussion is how accurate are the metrics in the current 
implementation? I have run a few tests to check that the values measured make 
sense and are in the ballpark of what expected. 
   In particular, measuring execution time can be challenging at times, as with 
this we attept to do it from JVM. I have put in the metrics description some 
hints to the nuances I found when testing . Send and receive time back and 
forth between JVM and Python workers seem to overlap in some cases, due the use 
of queues. 
   
   I think the "time spent sending data" can be useful when troubleshooting 
cases where the performance problem is with sending a lot of data to Python. 
Time spent executinf is probably the key metric to understand the overall 
performance. Number of rows returned will probably be another useful one, to 
understand the progress of the execution when monitoring the progress of an 
active query.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LucaCanali commented on pull request #31367: [SPARK-34265][PYTHON][SQL] Instrument Python UDF using SQL Metrics

Reply via email to