Adam Davison created SPARK-4315:
-----------------------------------

             Summary: PySpark pickling of pyspark.sql.Row objects is extremely 
inefficient
                 Key: SPARK-4315
                 URL: https://issues.apache.org/jira/browse/SPARK-4315
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.1.0
         Environment: Ubuntu, Python 2.7, Spark 1.1.0
            Reporter: Adam Davison


Working with an RDD of pyspark.sql.Row objects, created by reading a file with 
SQLContext in a local PySpark context.

Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are 
extremely slow (more than 10x slower than an equivalent Scala/Spark 
implementation). Obviously I expected it to be somewhat slower, but I did a bit 
of digging given the difference was so huge.

Luckily it's fairly easy to add profiling to the Python workers. I see that the 
vast majority of time is spent in:

spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object)

It seems that this line attempts to accelerate pickling of Rows with the use of 
a cache. Some debugging reveals that this cache becomes quite big (100s of 
entries). Disabling the cache by adding:

return _create_cls(dataType)(obj)

as the first line of _restore_object made my query run 5x faster. Implying that 
the caching is not providing the desired acceleration...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to