Francesco Cavrini created SPARK-30580:
-----------------------------------------
Summary: Why can PySpark persist data only in serialised format?
Key: SPARK-30580
URL: https://issues.apache.org/jira/browse/SPARK-30580
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.4.0
Reporter: Francesco Cavrini
The storage levels in PySpark allow to persist data only in serialised format.
There is also [a
comment|[https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py#L28]]
explicitly stating that "Since the data is always serialized on the Python
side, all the constants use the serialized formats." While that makes totally
sense for RDDs, it is not clear to me why it is not possible to persist data
without serialisation when using the dataframe/dataset APIs. In theory, in such
cases, the persist would only be a directive and data would never leave the
JVM, thus allowing for un-serialised persistence, correct? Many thanks for the
feedback!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]