Github user davies commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-140925636
@justinuang This patch works pretty well on multiple UDFs, but I have two
concerns before review the details: 1) it have some overhead for each batch,
cause some regression for single UDF, 2) lots of changes in PythonRDD and
worker.py, Python UDFs work differently than other PythonRDD, increase the
complicity (and potential bugs).
There is another approach as we discussed before, using better cache for
upstream RDD, could be called CacheOnceRDD, which appends all the rows into an
array when compute() is called for the first time, then pull and remove the
rows when compute() is called second time. This CachedOnceRDD should work
similar to UnsafeExternalSorter (spilling to disk if no enough memory). I think
that this approach should have better performance than current approach without
change PythonRDD (which is already very complicated). I really want to try this
out, but have not got some time to work on it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]