Github user justinuang commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141117878
The solution with the iterator wrapper was my first approach that I
prototyped
(http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html).
It's dangerous because there is buffering at many levels, in which case we can
run into a deadlock situation.
- NEW: the ForkingIterator LinkedBlockingDeque
- batching the rows before pickling them
- os buffers on both sides
- pyspark.serializers.BatchedSerializer
We can avoid deadlock by being very disciplined. For example, we can
have the ForkingIterator instead always do a check of whether the
LinkedBlockingDeque is full and if so:
Java
- flush the java pickling buffer
- send a flush command to the python process
- os.flush the java side
Python
- flush BatchedSerializer
- os.flush()
I'm not sure that this udf performance regression for one UDF is going to
hit many people. For one, most upstreams are not a range() call, which doesn't
have to go back to disk and deserialize. My personal opinion is that the
blocking performance shouldn't be the reason that we reject this approach, but
because it adds complexity.
If we want a quick fix that is safe, I would be in favor of passing the
row, which indeed is slower, but better than deadlocking or calculating
upstream twice. It's just that the current system is unacceptable.
Maybe we can also consider going with a complete architecture shift that
goes with a batching system, but uses thrift to serialize the scala types to a
language agnostic format, and also handle the blocking RPC. Then we can have
PySpark and SparkR using the same simple UDF architecture. The main drawback is
that I'm not sure how we're going to support broadcast variables or
aggregators, but should those even be supported with UDFs?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]