[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

mateiz Mon, 21 Apr 2014 23:12:19 -0700

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/477#issuecomment-41006940
  
    Hey, FYI, it's not a good idea to use System.nanoTime as the seed because 
multiple RDDs created at the same time (which can easily happen due to lazy 
evaluation) would have the exact same seed. Use math.random() instead, or the 
equivalent in PySpark. Math.random is synchronized as far as I know, which is 
bad for high-performance random number generation but good for getting distinct 
numbers here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1438 RDD.sample() make seed param option...

Reply via email to