[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

davies Wed, 16 Sep 2015 16:32:40 -0700

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8662#issuecomment-140925636
  
    @justinuang This patch works pretty well on multiple UDFs, but I have two 
concerns before review the details: 1) it have some overhead for each batch, 
cause some regression for single UDF, 2) lots of changes in PythonRDD and 
worker.py, Python UDFs work differently than other PythonRDD, increase the 
complicity (and potential bugs).
    
    There is another approach as we discussed before, using better cache for 
upstream RDD, could be called CacheOnceRDD, which appends all the rows into an 
array when compute() is called for the first time, then pull and remove the 
rows when compute() is called second time. This CachedOnceRDD should work 
similar to UnsafeExternalSorter (spilling to disk if no enough memory). I think 
that this approach should have better performance than current approach without 
change PythonRDD (which is already very complicated). I really want to try this 
out, but have not got some time to work on it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

Reply via email to