[jira] [Created] (SPARK-3095) [PySpark] Speed up RDD.count()

Davies Liu (JIRA) Sun, 17 Aug 2014 22:45:07 -0700

Davies Liu created SPARK-3095:
---------------------------------

             Summary: [PySpark] Speed up RDD.count()
                 Key: SPARK-3095
                 URL: https://issues.apache.org/jira/browse/SPARK-3095
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
            Reporter: Davies Liu
            Priority: Minor



RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not PipelineRDD.

If the JavaRDD is serialized in batch mode, it's possible to skip the 
deserialization of chunks (except the last one), because they all have the same 
number of elements in them. There are some special cases that the chunks are 
re-ordered, so this will not work.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-3095) [PySpark] Speed up RDD.count()

Reply via email to