[ 
https://issues.apache.org/jira/browse/SPARK-26906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Altae-Tran updated SPARK-26906:
-----------------------------------
       Priority: Minor  (was: Major)
    Description: 
Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
simple example, the UI reports only 1x replication, despite using the flag for 
2x replication
{code:java}
rdd = sc.range(10**9)
mapped = rdd.map(lambda x: x)
mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
PythonRDD.scala:52

mapped.count(){code}
 

Interestingly, if you catch the UI page at just the right time, you see that it 
starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the RDD 
is replicated, but it is just the UI that is unable to register this.  

  was:
Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
simple example, the UI reports only 1x replication, despite using the flag for 
2x replication
{code:java}
rdd = sc.range(10**9)
mapped = rdd.map(lambda x: x)
mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
PythonRDD.scala:52

mapped.count(){code}
 

resulting in the following:

!image-2019-02-17-01-33-08-551.png!

 

Interestingly, if you catch the UI page at just the right time, you see that it 
starts off 2x replicated:

 

!image-2019-02-17-01-35-37-034.png!

 

but ends up going back to 1x replicated once the RDD is fully materialized. 
This is likely not a UI bug because the cached partitions page also shows only 
1x replication:

 

!image-2019-02-17-01-36-55-418.png!

 

This could result from some type of optimization for replication, but is 
undesirable for users that want a specific level of replication for fault 
tolerance. 

        Summary: Pyspark RDD Replication Potentially Not Working  (was: Pyspark 
RDD Replication Not Working)

> Pyspark RDD Replication Potentially Not Working
> -----------------------------------------------
>
>                 Key: SPARK-26906
>                 URL: https://issues.apache.org/jira/browse/SPARK-26906
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Web UI
>    Affects Versions: 2.3.2
>         Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 
> 2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018]
>  (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with 
> python version 3.7. PySpark shell is activated using pyspark --num-executors 
> = 100
>            Reporter: Han Altae-Tran
>            Priority: Minor
>
> Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
> simple example, the UI reports only 1x replication, despite using the flag 
> for 2x replication
> {code:java}
> rdd = sc.range(10**9)
> mapped = rdd.map(lambda x: x)
> mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
> PythonRDD.scala:52
> mapped.count(){code}
>  
> Interestingly, if you catch the UI page at just the right time, you see that 
> it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the 
> RDD is replicated, but it is just the UI that is unable to register this.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to