[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-23693: ------------------------------------ Assignee: (was: Apache Spark) > SQL function uuid() > ------------------- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.2.1, 2.3.0 > Reporter: Arseniy Tashoyan > Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:scala} > def uuid(): Column > {code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org