Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/21858#discussion_r206643036
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1150,16 +1150,48 @@ object functions {
/**
* A column expression that generates monotonically increasing 64-bit
integers.
*
- * The generated ID is guaranteed to be monotonically increasing and
unique, but not consecutive.
+ * The generated IDs are guaranteed to be monotonically increasing and
unique, but not
+ * consecutive (unless all rows are in the same single partition which
you rarely want due to
+ * the volume of the data).
* The current implementation puts the partition ID in the upper 31
bits, and the record number
* within each partition in the lower 33 bits. The assumption is that
the data frame has
* less than 1 billion partitions, and each partition has less than 8
billion records.
*
- * As an example, consider a `DataFrame` with two partitions, each with
3 records.
- * This expression would return the following IDs:
- *
* {{{
- * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+ * // Create a dataset with four partitions, each with two rows.
+ * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
+ *
+ * // Make sure that every partition has the same number of rows
+ * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows =>
assert(rows.next == 2))
+ * q.select(monotonically_increasing_id).show
--- End diff --
I personally would simplify the example to not focus on the particular
shift; yeah that behavior ought not change but it's not really something a
caller would ever rely on. And I think you don't need to make a new variable to
subtract 1 from row number, etc. Something simply showing the two properties --
increasing within partition, not between partitions -- is enough.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]