LucaCanali commented on a change in pull request #26953:
[SPARK-30306][CORE][PYTHON] Instrument Python UDF execution time and throughput
metrics using Spark Metrics system
URL: https://github.com/apache/spark/pull/26953#discussion_r369652547
##########
File path: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala
##########
@@ -434,7 +443,10 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
override def hasNext: Boolean = nextObj != null || {
if (!eos) {
+ val startTime = System.nanoTime()
nextObj = read()
+ val deltaTime = System.nanoTime()-startTime
+ PythonMetrics.incFromWorkerReadTime(deltaTime)
Review comment:
- I like the idea of adding SQLMetrics for Python UDF instrumentation and
use them in the WEBUI. However, I think the work would rather fit for a
separate JIRA/PR. The implementation details and the overhead of SQLMetrics are
different from Dropwizard-based metrics, so probably we would like to have only
a limited number of SQLMetrics instrumenting task activities in this area. Also
the implementation of SQLMetrics for `[[PythonUDF]]` execution may require some
important changes to the current plan evaluation code.
- It is indeed the case that the “read time from worker” which is exposed to
the users via the dropwizard library as “FetchResultsTimeFromWorkers” contains
both socket I/O + deserialization time and Python UDF execution time. Measuring
on the Python side could allow to separate the 2 time components, however
currently I don’t see how to make a lightweight implementation for that. Python
profiler has the possibility to measure on the Python side as you mentioned,
but I see its usage more for debugging, while the proposed instrumentation is
lightweight and intended to be used for production use cases too. Maybe future
work can address this case if there is need?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]