[GitHub] [iceberg] samredai commented on a diff in pull request #4463: Docs: update Spark Write doc for partitioned tables

GitBox Tue, 03 May 2022 18:05:49 -0700


samredai commented on code in PR #4463:
URL: https://github.com/apache/iceberg/pull/4463#discussion_r864397904



##########
docs/spark/spark-writes.md:
##########
@@ -311,7 +311,11 @@ distribution & sort order to Spark.
 {{< /hint >}}
 
 {{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`) 
work for the requirement.
+Both global sort (sorting all the data in the write) and local sort (sorting 
the data within a Spark task) can be used to write against partitioned table.

Review Comment:
   ```suggestion
   Both global sort (sorting all the data in the write) and local sort (sorting 
the data within a Spark task) can be used to write against a partitioned table.
   ```



##########
docs/spark/spark-writes.md:
##########
@@ -376,17 +413,17 @@ Explicit registration of the function is necessary 
because Spark doesn't allow I
 which can be used in query.
 {{< /hint >}}
 
-Here we just registered the bucket function as `iceberg_bucket16`, which can 
be used in sort clause.
+Here the bucket function is registered as `iceberg_bucket16`, which can be 
used in sort clause.

Review Comment:
   ```suggestion
   Here the bucket function is registered as `iceberg_bucket16`, which can be 
used in a sort clause.
   ```



##########
docs/spark/spark-writes.md:
##########
@@ -326,28 +330,61 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
+#### In Spark SQL

Review Comment:
   ```suggestion
   #### Sort Order Using Spark SQL
   ```



##########
docs/spark/spark-writes.md:
##########
@@ -326,28 +330,61 @@ USING iceberg
 PARTITIONED BY (days(ts), category)
 ```
 
-To write data to the sample table, your data needs to be sorted by `days(ts), 
category`.
+#### In Spark SQL
 
-If you're inserting data with SQL statement, you can use `ORDER BY` to achieve 
it, like below:
+To globally sort data based on `ts` and `category`:
 
 ```sql
 INSERT INTO prod.db.sample
 SELECT id, data, category, ts FROM another_table
 ORDER BY ts, category
 ```
 
-If you're inserting data with DataFrame, you can use either `orderBy`/`sort` 
to trigger global sort, or `sortWithinPartitions`
-to trigger local sort. Local sort for example:
+To locally sort data based on `ts` and `category`:
+
+```sql
+INSERT INTO prod.db.sample
+SELECT id, data, category, ts FROM another_table
+SORT BY ts, category
+```
+
+`SORT BY` clauses can also be used with partition transforms. The 
[date-and-timestamp-functions](https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#date-and-timestamp-functions)
 should be used when partition transforms are time related. Truncate related 
functions such as `substr` should be used when the partition transform is 
`truncate[W]`. It is required to [define and register 
UDFs](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html) 
when the partition transform is [bucket transform](##Bucket Transform).
+
+```sql
+INSERT INTO prod.db.sample
+SELECT id, data, category, ts FROM another_table
+SORT BY day(ts), category
+```
+
+#### In the Dataframe API

Review Comment:
   ```suggestion
   #### Sort Order Using the Dataframe API
   ```
   This is so the item in the table of content is more meaningful



##########
docs/spark/spark-writes.md:
##########
@@ -311,7 +311,11 @@ distribution & sort order to Spark.
 {{< /hint >}}
 
 {{< hint info >}}
-Both global sort (`orderBy`/`sort`) and local sort (`sortWithinPartitions`) 
work for the requirement.
+Both global sort (sorting all the data in the write) and local sort (sorting 
the data within a Spark task) can be used to write against partitioned table.
+
+In SQL, the [`ORDER 
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html)
 will achieve global sorting and [`SORT 
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html)
 will achieve local sorting.

Review Comment:
   ```suggestion
   In SQL, [`ORDER 
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html)
 will achieve global sorting and [`SORT 
BY`](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html)
 will achieve local sorting.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] samredai commented on a diff in pull request #4463: Docs: update Spark Write doc for partitioned tables

Reply via email to