Davies Liu created SPARK-6728: --------------------------------- Summary: Improve performance of py4j for large bytearray Key: SPARK-6728 URL: https://issues.apache.org/jira/browse/SPARK-6728 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu
PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M). In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash. The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat. Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org