Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21802#discussion_r203962951 --- Diff: python/pyspark/sql/functions.py --- @@ -2382,6 +2382,20 @@ def array_sort(col): return Column(sc._jvm.functions.array_sort(_to_java_column(col))) +@since(2.4) +def shuffle(col): + """ + Collection function: Generates a random permutation of the given array. + + .. note:: The function is non-deterministic because its results depends on order of rows which --- End diff -- The seed is fixed when analysis phase, so if we, say, `collect()` twice or more from the same DataFrame, we will get the same result: ```scala val df = .. .select(shuffle('arr)) df.collect() == df.collect() ``` but if we create another DataFrame from the same input, we will get different results: ```scala val df1 = .. .select(shuffle('arr)) val df2 = .. .select(shuffle('arr)) df1.collect() != df2.collect() ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org