Davies Liu created SPARK-3095:
---------------------------------

             Summary: [PySpark] Speed up RDD.count()
                 Key: SPARK-3095
                 URL: https://issues.apache.org/jira/browse/SPARK-3095
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
            Reporter: Davies Liu
            Priority: Minor


RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not PipelineRDD.

If the JavaRDD is serialized in batch mode, it's possible to skip the 
deserialization of chunks (except the last one), because they all have the same 
number of elements in them. There are some special cases that the chunks are 
re-ordered, so this will not work.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to