Arseniy Tashoyan created SPARK-23693: ----------------------------------------
Summary: SQL function uuid() Key: SPARK-23693 URL: https://issues.apache.org/jira/browse/SPARK-23693 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0, 2.2.1 Reporter: Arseniy Tashoyan Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: def uuid(): String that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org