[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...

ueshin Fri, 20 Jul 2018 00:50:17 -0700

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21802#discussion_r203962951
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -2382,6 +2382,20 @@ def array_sort(col):
         return Column(sc._jvm.functions.array_sort(_to_java_column(col)))
     
     
    +@since(2.4)
    +def shuffle(col):
    +    """
    +    Collection function: Generates a random permutation of the given array.
    +
    +    .. note:: The function is non-deterministic because its results 
depends on order of rows which
    --- End diff --
    
    The seed is fixed when analysis phase, so if we, say, `collect()` twice or 
more from the same DataFrame, we will get the same result:
    
    ```scala
    val df = .. .select(shuffle('arr))
    df.collect() == df.collect()
    ```
    
    but if we create another DataFrame from the same input, we will get 
different results:
    
    ```scala
    val df1 = .. .select(shuffle('arr))
    val df2 = .. .select(shuffle('arr))
    df1.collect() != df2.collect()
    ```




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21802: [SPARK-23928][SQL] Add shuffle collection functio...

Reply via email to