Looks like what you want is to add a column that, when ordered by that
column, the current order of the dateframe is preserved.

All you need is the monotonically_increasing_id() function:

spark.range(0, 10, 1, 5).withColumn("row",
monotonically_increasing_id()).show()
+---+-----------+
| id|        row|
+---+-----------+
|  0|          0|
|  1|          1|
|  2| 8589934592|
|  3| 8589934593|
|  4|17179869184|
|  5|17179869185|
|  6|25769803776|
|  7|25769803777|
|  8|34359738368|
|  9|34359738369|
+---+-----------+

Within a partition, all columns have consecutive row numbers, the start
of a new partition observes a jump in the row number. The example above
has 5 partitions with 2 rows each.

In case you need a global consecutive row number (not needed to preserve
current dataframe order as you want it), you can use the
Dataframe.with_row_numbers() method provided by the Spark-Extension
package:
https://github.com/G-Research/spark-extension/blob/master/ROW_NUMBER.md

import gresearch.spark

df.with_row_numbers().show()
+---+----------+
| id|row_number|
+---+----------+
|  1|         1|
|  2|         2|
|  2|         3|
|  3|         4|
+---+----------+

Cheers,
Enrico



Am 05.12.23 um 18:25 schrieb Som Lima:

want to maintain the order of the rows in the data frame in Pyspark.
Is there any way to achieve this for this function here we have the
row ID which will give numbering to each row. Currently, the below
function results in the rearrangement of the row in the data frame.

|def createRowIdColumn(new_column, position, start_value): row_count =
df.count() row_ids = spark.range(int(start_value), int(start_value) +
row_count, 1).toDF(new_column) window = Window.orderBy(lit(1))
df_row_ids = row_ids.withColumn("row_num", row_number().over(window) -
1) df_with_row_num = df.withColumn("row_num",
row_number().over(window) - 1) if position == "Last Column": result =
df_with_row_num.join(df_row_ids, on="row_num").drop("row_num") else:
result = df_row_ids.join(df_with_row_num,
on="row_num").drop("row_num") return result.orderBy(new_column) |

Please let me know the solution if we can achieve this requirement.


Reply via email to