Xiao Ming Bao created SPARK-17602: ------------------------------------- Summary: PySpark - Performance Optimization Large Size of Broadcast Variable Key: SPARK-17602 URL: https://issues.apache.org/jira/browse/SPARK-17602 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.5.1 Environment: Linux Reporter: Xiao Ming Bao Fix For: 2.0.0
Problem: currently at executor side, the broadcast variable is written to disk as file and each python work process reads the bd from local disk and de-serialize to python object before executing a task, when the size of broadcast variables is large, the read/de-serialization takes a lot of time. And when the python worker is NOT reused and the number of task is large, this performance would be very bad since python worker needs to read/de-serialize for each task. Brief of the solution: transfer the broadcast variable to daemon python process via file (or socket/mmap) and deserialize file to object in daemon python process, after worker python process forked by daemon python process, worker python process would automatically has the deserialzied object and use it directly because of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org