[
https://issues.apache.org/jira/browse/SPARK-33380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-33380.
----------------------------------
Resolution: Won't Fix
> Incorrect output from example script pi.py
> ------------------------------------------
>
> Key: SPARK-33380
> URL: https://issues.apache.org/jira/browse/SPARK-33380
> Project: Spark
> Issue Type: Bug
> Components: Examples
> Affects Versions: 2.4.6
> Reporter: Milind V Damle
> Priority: Minor
>
>
> I have Apache Spark v2.4.6 installed on my mini cluster of 1 driver and 2
> worker nodes. To test the installation, I ran the
> $SPARK_HOME/examples/src/main/python/pi.py script included with Spark-2.4.6.
> Three runs produced the following output:
>
> m4-nn:~:spark-submit --master
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.149880
> m4-nn:~:spark-submit --master
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.137760
> m4-nn:~:spark-submit --master
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]
> /usr/local/spark/examples/src/main/python/pi.py
> Pi is roughly 3.155640
>
> I noted that the computed value of Pi varies with each run.
> Next, I ran the same script 3 more times with a higher number of partitions
> (16). The following output was noted.
> m4-nn:~:spark-submit --master
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.141100
> m4-nn:~:spark-submit --master
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.137720
> m4-nn:~:spark-submit --master
> spark://[10.0.0.20:7077|http://10.0.0.20:7077/]
> /usr/local/spark/examples/src/main/python/pi.py 16
> Pi is roughly 3.145660
>
> Again, I noted that the computed value of Pi varies with each run.
>
> IMO, there are 2 issues with this example script:
> 1. The output (value of pi) is non-deterministic because the script uses
> random.random().
> 2. Specifying the number of partitions (accepted as a command-line argument)
> has no observable positive impact on the accuracy or precision.
>
> It may be argued that the intent of these examples scripts is simply to
> demonstrate how to use Spark as well as offer a means to quickly verify an
> installation. However, we can achieve that objective without compromising on
> the accuracy or determinism of the computed value. Unless the user examines
> the script and understands that use of random.random() (to generate random
> points within the top right quadrant of the circle) as the reason behind the
> non-determinism, it seems confusing at first that the value varies per run
> and also that it is inaccurate. Someone may (incorrectly) infer that as a
> limitation of the framework!
>
> To mitigate this, I wrote an alternate version to compute pi using a partial
> sum of terms from an infinite series. This script is both deterministic and
> can produce more accurate output if the user configures it to use more terms.
> To me, that behavior feels intuitive and logical. I will be happy to share it
> if it is appropriate.
>
> Best regards,
> Milind
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]