Arseniy Tashoyan created SPARK-23693:
----------------------------------------

             Summary: SQL function uuid()
                 Key: SPARK-23693
                 URL: https://issues.apache.org/jira/browse/SPARK-23693
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0, 2.2.1
            Reporter: Arseniy Tashoyan


Add function uuid() to org.apache.spark.sql.functions that returns [Universally 
Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].

Sometimes it is necessary to uniquely identify each row in a DataFrame.

Currently the following ways are available:
 * monotonically_increasing_id() function
 * row_number() function over some window
 * convert the DataFrame to RDD and zipWithIndex()

All these approaches do not work when appending this DataFrame to another 
DataFrame (union). Collisions may occur - two rows in different DataFrames may 
have the same ID. Re-generating IDs on the resulting DataFrame is not an 
option, because some data in some other system may already refer to old IDs.

The proposed solution is to add new function:

def uuid(): String

that returns String representation of UUID.

UUID is represented as a 128-bit number (two long numbers). Such numbers are 
not supported in Scala or Java. In addition, some storage systems do not 
support 128-bit numbers (Parquet's largest numeric type is INT96). This is the 
reason for the uuid() function to return String.

I already have a simple implementation based on 
[java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to