HeartSaVioR commented on a change in pull request #1523:
URL: https://github.com/apache/iceberg/pull/1523#discussion_r496339735
##########
File path: site/docs/spark.md
##########
@@ -519,6 +519,59 @@ data.writeTo("prod.db.table")
.createOrReplace()
```
+## Writing against partitioned table
+
+Iceberg requires the data to be sorted according to the partition spec in
prior to write against partitioned table.
Review comment:
Please correct me if I'm missing something. (Sorry for being too correct
technically - I'm also a learner of Iceberg so just to understand correctly.)
If I understand correctly, at least "Iceberg Spark writer" requires the data
to be sorted according to the partition spec in task (Spark partition), not
just the data to be clustered by partition.
Below query fails:
```
spark.sql("""
CREATE TABLE iceberg_catalog.default.sample1 (
id bigint,
data string,
category string)
USING iceberg
PARTITIONED BY (category)
""")
val data = (0 to 100000).map { id =>
(id, s"hello$id", s"category-${id % 100}")
}
data.toDF("id", "data", "category").repartition(100,
col("category")).sortWithinPartitions("id").writeTo("iceberg_catalog.default.sample1").append()
```
Mentioning the global and local sorts would be nice to have. Thanks! Will
add.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]