[ 
https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33380:
------------------------------------

    Assignee:     (was: Apache Spark)

> Incorrect output from example script pi.py
> ------------------------------------------
>
>                 Key: SPARK-33380
>                 URL: https://issues.apache.org/jira/browse/SPARK-33380
>             Project: Spark
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 2.4.6
>            Reporter: Milind V Damle
>            Priority: Minor
>
>  
> I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2 
> worker nodes. To test the installation, I ran the 
> $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6. 
> Three runs produced the following output:
>  
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.149880
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.137760
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.155640
>  
> I noted that the computed value of Pi varies with each run.
> Next, I ran the same script 3 more times with a higher number of partitions 
> (16). The following output was noted.
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.141100
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.137720
> m4-nn:~:spark-submit  --master 
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]  
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.145660
>  
> Again, I noted that the computed value of Pi varies with each run. 
>  
> IMO, there are 2 issues with this example script:
> 1. The output (value of pi) is non-deterministic because the script uses 
> random.random(). 
> 2. Specifying the number of partitions (accepted as a command-line argument) 
> has no observable positive impact on the accuracy or precision. 
>  
> It may be argued that the intent of these examples scripts is simply to 
> demonstrate how to use Spark as well as offer a means to quickly verify an 
> installation. However, we can achieve that objective without compromising on 
> the accuracy or determinism of the computed value. Unless the user examines 
> the script and understands that use of random.random() (to generate random 
> points within the top right quadrant of the circle) as the reason behind the 
> non-determinism, it seems confusing at first that the value varies per run 
> and also that it is inaccurate. Someone may (incorrectly) infer that as a 
> limitation of the framework!
>  
> To mitigate this, I wrote an alternate version to compute pi using a partial 
> sum of terms from an infinite series. This script is both deterministic and 
> can produce more accurate output if the user configures it to use more terms. 
> To me, that behavior feels intuitive and logical. I will be happy to share it 
> if it is appropriate.
>  
> Best regards,
> Milind
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to