Github user rxin commented on a diff in the pull request:
https://github.com/apache/spark/pull/15089#discussion_r81233633
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
---
@@ -37,9 +40,25 @@ import org.apache.spark.sql.types.{DataType,
StructField, StructType}
* Python evaluation works by sending the necessary (projected) input data
via a socket to an
* external Python process, and combine the result from the Python process
with the original row.
*
- * For each row we send to Python, we also put it in a queue. For each
output row from Python,
+ * For each row we send to Python, we also put it in a queue first. For
each output row from Python,
* we drain the queue to find the original input row. Note that if the
Python process is way too
- * slow, this could lead to the queue growing unbounded and eventually run
out of memory.
+ * slow, this could lead to the queue growing unbounded and spill into
disk when run out of memory.
+ *
+ * Here is a diagram to show how this works:
+ *
+ * Upstream (from child)
--- End diff --
i suggest switching upstream and downstream and and put child operator
(upstream) at the bottom, and parent operator (downstream) at the top to be
more consistent with databases. basically the "tree" would face upward.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]