[jira] [Reopened] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

Abhishek Dixit (Jira) Thu, 07 May 2020 05:33:01 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Abhishek Dixit reopened SPARK-31448:
------------------------------------

Let me try to explain the problem more. 

Please look at this code in pyspark/dataframe.py: 
{code:java}
//  @since(1.3)    def cache(self):        """Persists the :class:`DataFrame` 
with the default storage level (C{MEMORY_AND_DISK}).        .. note:: The 
default storage level has changed to C{MEMORY_AND_DISK} to match Scala in 2.0.  
      """        self.is_cached = True        self._jdf.cache()        return 
self
{code}
Cache method in pyspark data frame directly calls scala's cache method. Hence 
Storage level used is based on Scala defaults i.e. StorageLevel(true, true, 
false, true)  with deserialized equal to true. But since, data from python is 
already serialized by the Pickle library, we should be using storage level with 
deserialized = false for pyspark dataframes.

But if you look at cache method in pyspark/rdd.py, it sets the storage level in 
pyspark only and then calls the scala method with parameter. Hence correct 
storage level is used in this case with deserialzied = false.
{code:java}
// def cache(self):        """        Persist this RDD with the default storage 
level (C{MEMORY_ONLY}).        """        self.is_cached = True        
self.persist(StorageLevel.MEMORY_ONLY)        return self
{code}
 We need to implement a similar way in cache method in dataframe.py to avoid 
using the scala defaults of deserialized = true

 

 

> Difference in Storage Levels used in cache() and persist() for pyspark 
> dataframes
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-31448
>                 URL: https://issues.apache.org/jira/browse/SPARK-31448
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: Abhishek Dixit
>            Priority: Major
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark 
> and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>  
> *Problem Description:* 
> Calling *df.cache()*  for pyspark dataframe directly invokes Scala method 
> cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the 
> newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and 
> then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of 
> calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and 
> the possible fix is the correct approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Reopened] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

Reply via email to