[GitHub] [spark] itholic commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

GitBox Sun, 16 Jan 2022 22:31:28 -0800


itholic commented on a change in pull request #33559:
URL: https://github.com/apache/spark/pull/33559#discussion_r785663893




##########
File path: docs/web-ui.md
##########
@@ -406,6 +406,8 @@ Here is the list of SQL metrics:
 <tr><td> <code>time to build hash map</code> </td><td> the time spent on 
building hash map </td><td> ShuffledHashJoin </td></tr>
 <tr><td> <code>task commit time</code> </td><td> the time spent on committing 
the output of a task after the writes succeed </td><td> any write operation on 
a file-based table </td></tr>
 <tr><td> <code>job commit time</code> </td><td> the time spent on committing 
the output of a job after the writes succeed </td><td> any write operation on a 
file-based table </td></tr>
+<tr><td> <code>data sent to Python workers</code> </td><td> the number of 
bytes of serialized data sent to the Python workers </td><td> ArrowEvalPython, 
AggregateInPandas, FlaMapGroupsInPandas, FlatMapsCoGroupsInPandas, MapInPandas, 
PythonMapInArrow, WindowsInPandas </td></tr>

Review comment:
       nit: `FlaMapGroupsInPandas` -> `FlatMapGroupsInPandas` ?

##########
File path: docs/web-ui.md
##########
@@ -406,6 +406,8 @@ Here is the list of SQL metrics:
 <tr><td> <code>time to build hash map</code> </td><td> the time spent on 
building hash map </td><td> ShuffledHashJoin </td></tr>
 <tr><td> <code>task commit time</code> </td><td> the time spent on committing 
the output of a task after the writes succeed </td><td> any write operation on 
a file-based table </td></tr>
 <tr><td> <code>job commit time</code> </td><td> the time spent on committing 
the output of a job after the writes succeed </td><td> any write operation on a 
file-based table </td></tr>
+<tr><td> <code>data sent to Python workers</code> </td><td> the number of 
bytes of serialized data sent to the Python workers </td><td> ArrowEvalPython, 
AggregateInPandas, FlaMapGroupsInPandas, FlatMapsCoGroupsInPandas, MapInPandas, 
PythonMapInArrow, WindowsInPandas </td></tr>
+<tr><td> <code>data returned from Python workers</code> </td><td> the number 
of bytes of serialized data received back from the Python workers </td><td> 
ArrowEvalPython, AggregateInPandas, FlaMapGroupsInPandas, 
FlatMapsCoGroupsInPandas, MapInPandas, PythonMapInArrow, WindowsInPandas 
</td></tr>

Review comment:
       ditto ?

##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala
##########
@@ -42,7 +43,10 @@ class ArrowPythonRunner(
     argOffsets: Array[Array[Int]],
     schema: StructType,
     timeZoneId: String,
-    conf: Map[String, String])
+    conf: Map[String, String],
+    pythonDataSent: SQLMetric,
+    val pythonDataReceived: SQLMetric,
+    val pythonNumRowsReceived: SQLMetric)

Review comment:
       qq: Why `pythonDataReceived` and `pythonNumRowsReceived` are defined as 
a variable here ?

##########
File path: docs/web-ui.md
##########
@@ -406,6 +406,8 @@ Here is the list of SQL metrics:
 <tr><td> <code>time to build hash map</code> </td><td> the time spent on 
building hash map </td><td> ShuffledHashJoin </td></tr>
 <tr><td> <code>task commit time</code> </td><td> the time spent on committing 
the output of a task after the writes succeed </td><td> any write operation on 
a file-based table </td></tr>
 <tr><td> <code>job commit time</code> </td><td> the time spent on committing 
the output of a job after the writes succeed </td><td> any write operation on a 
file-based table </td></tr>
+<tr><td> <code>data sent to Python workers</code> </td><td> the number of 
bytes of serialized data sent to the Python workers </td><td> ArrowEvalPython, 
AggregateInPandas, FlaMapGroupsInPandas, FlatMapsCoGroupsInPandas, MapInPandas, 
PythonMapInArrow, WindowsInPandas </td></tr>
+<tr><td> <code>data returned from Python workers</code> </td><td> the number 
of bytes of serialized data received back from the Python workers </td><td> 
ArrowEvalPython, AggregateInPandas, FlaMapGroupsInPandas, 
FlatMapsCoGroupsInPandas, MapInPandas, PythonMapInArrow, WindowsInPandas 
</td></tr>

Review comment:
       And maybe we also need to document the "number of output rows" here ??




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a change in pull request #33559: [SPARK-34265][PYTHON][SQL] Instrument Pandas UDFs using SQL metrics

Reply via email to