[jira] [Created] (SPARK-6728) Improve performance of py4j for large bytearray

Davies Liu (JIRA) Mon, 06 Apr 2015 15:52:52 -0700

Davies Liu created SPARK-6728:
---------------------------------

             Summary: Improve performance of py4j for large bytearray
                 Key: SPARK-6728
                 URL: https://issues.apache.org/jira/browse/SPARK-6728
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
            Reporter: Davies Liu



PySpark relies on py4j to transfer function arguments and return between Python 
and JVM, it's very slow to pass a large bytearray (larger than 10M). 

In MLlib, it's possible to have a Vector with more than 100M bytes, which will 
need few GB memory, may crash.

The reason is that py4j use text protocol, it will encode the bytearray as 
base64, and do multiple string concat. 

Binary will help a lot, create a issue for py4j: 
https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6728) Improve performance of py4j for large bytearray

Reply via email to