Davies Liu created SPARK-3095:
---------------------------------
Summary: [PySpark] Speed up RDD.count()
Key: SPARK-3095
URL: https://issues.apache.org/jira/browse/SPARK-3095
Project: Spark
Issue Type: Improvement
Components: PySpark
Reporter: Davies Liu
Priority: Minor
RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not PipelineRDD.
If the JavaRDD is serialized in batch mode, it's possible to skip the
deserialization of chunks (except the last one), because they all have the same
number of elements in them. There are some special cases that the chunks are
re-ordered, so this will not work.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]