[
https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100374#comment-17100374
]
Tianshi Zhu commented on SPARK-31448:
-------------------------------------
I found the following comment in StorageLevel.py in Spark 2.4.3:
_".. note:: The following four storage level constants are deprecated in 2.0,
since the records_
_will always be serialized in Python."_
[https://github.com/apache/spark/blob/v2.4.3/python/pyspark/storagelevel.py#L61]
So I would assume the counterpart in Scala is
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L162]
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true) means the data
is deserialized. Does that help?
> Difference in Storage Levels used in cache() and persist() for pyspark
> dataframes
> ---------------------------------------------------------------------------------
>
> Key: SPARK-31448
> URL: https://issues.apache.org/jira/browse/SPARK-31448
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3
> Reporter: Abhishek Dixit
> Priority: Major
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark
> and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>
> *Problem Description:*
> Calling *df.cache()* for pyspark dataframe directly invokes Scala method
> cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the
> newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and
> then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of
> calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and
> the possible fix is the correct approach.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]