samredai commented on code in PR #4463:
URL: https://github.com/apache/iceberg/pull/4463#discussion_r864397904
##########
docs/spark/spark-writes.md:
##########
@@ -311,7 +311,11 @@ distribution & sort order to Spark.
{{< /hint >}}
{{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`)
work for the requirement.
+Both global sort (sorting all the data in the write) and local sort (sorting
the data within a Spark task) can be used to write against partitioned table.
Review Comment:
```suggestion
Both global sort (sorting all the data in the write) and local sort (sorting
the data within a Spark task) can be used to write against a partitioned table.
```
##########
docs/spark/spark-writes.md:
##########
@@ -376,17 +413,17 @@ Explicit registration of the function is necessary
because Spark doesn't allow I
which can be used in query.
{{< /hint >}}
-Here we just registered the bucket function as `iceberg_bucket16`, which can
be used in sort clause.
+Here the bucket function is registered as `iceberg_bucket16`, which can be
used in sort clause.
Review Comment:
```suggestion
Here the bucket function is registered as `iceberg_bucket16`, which can be
used in a sort clause.
```
##########
docs/spark/spark-writes.md:
##########
@@ -326,28 +330,61 @@ USING iceberg
PARTITIONED BY (days(ts), category)
```
-To write data to the sample table, your data needs to be sorted by `days(ts),
category`.
+#### In Spark SQL
Review Comment:
```suggestion
#### Sort Order Using Spark SQL
```
##########
docs/spark/spark-writes.md:
##########
@@ -326,28 +330,61 @@ USING iceberg
PARTITIONED BY (days(ts), category)
```
-To write data to the sample table, your data needs to be sorted by `days(ts),
category`.
+#### In Spark SQL
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve
it, like below:
+To globally sort data based on `ts` and `category`:
```sql
INSERT INTO prod.db.sample
SELECT id, data, category, ts FROM another_table
ORDER BY ts, category
```
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort`
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
+To locally sort data based on `ts` and `category`:
+
+```sql
+INSERT INTO prod.db.sample
+SELECT id, data, category, ts FROM another_table
+SORT BY ts, category
+```
+
+`SORT BY` clauses can also be used with partition transforms. The
[date-and-timestamp-functions](https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#date-and-timestamp-functions)
should be used when partition transforms are time related. Truncate related
functions such as `substr` should be used when the partition transform is
`truncate[W]`. It is required to [define and register
UDFs](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html)
when the partition transform is [bucket transform](##Bucket Transform).
+
+```sql
+INSERT INTO prod.db.sample
+SELECT id, data, category, ts FROM another_table
+SORT BY day(ts), category
+```
+
+#### In the Dataframe API
Review Comment:
```suggestion
#### Sort Order Using the Dataframe API
```
This is so the item in the table of content is more meaningful
##########
docs/spark/spark-writes.md:
##########
@@ -311,7 +311,11 @@ distribution & sort order to Spark.
{{< /hint >}}
{{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`)
work for the requirement.
+Both global sort (sorting all the data in the write) and local sort (sorting
the data within a Spark task) can be used to write against partitioned table.
+
+In SQL, the [`ORDER
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html)
will achieve global sorting and [`SORT
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html)
will achieve local sorting.
Review Comment:
```suggestion
In SQL, [`ORDER
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html)
will achieve global sorting and [`SORT
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html)
will achieve local sorting.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]