Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/477#issuecomment-41006940
Hey, FYI, it's not a good idea to use System.nanoTime as the seed because
multiple RDDs created at the same time (which can easily happen due to lazy
evaluation) would have the exact same seed. Use math.random() instead, or the
equivalent in PySpark. Math.random is synchronized as far as I know, which is
bad for high-performance random number generation but good for getting distinct
numbers here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---