[jira] [Created] (SPARK-30580) Why can PySpark persist data only in serialised format?

Francesco Cavrini (Jira) Mon, 20 Jan 2020 00:44:38 -0800

Francesco Cavrini created SPARK-30580:
-----------------------------------------


             Summary: Why can PySpark persist data only in serialised format?
                 Key: SPARK-30580
                 URL: https://issues.apache.org/jira/browse/SPARK-30580
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.0
            Reporter: Francesco Cavrini


The storage levels in PySpark allow to persist data only in serialised format. 
There is also [a 
comment|[https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py#L28]]
 explicitly stating that "Since the data is always serialized on the Python 
side, all the constants use the serialized formats." While that makes totally 
sense for RDDs, it is not clear to me why it is not possible to persist data 
without serialisation when using the dataframe/dataset APIs. In theory, in such 
cases, the persist would only be a directive and data would never leave the 
JVM, thus allowing for un-serialised persistence, correct? Many thanks for the 
feedback!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-30580) Why can PySpark persist data only in serialised format?

Reply via email to