[GitHub] [iceberg] kbendick commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

GitBox Sun, 17 Jan 2021 16:08:55 -0800


kbendick commented on a change in pull request #2101:
URL: https://github.com/apache/iceberg/pull/2101#discussion_r559259799




##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with 
an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to 
write queries for a specific partition layout to be fast. Instead, you can 
write queries that select the data you need, and Iceberg automatically prunes 
out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. 
For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")

Review comment:
       I'm unable to find the `removeField` function with two string parameters 
in the `UpdatePartitionSpec` interface. Are you sure you intended to call 
`removeField` here and not possibly `renameField`?

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with 
an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to 
write queries for a specific partition layout to be fast. Instead, you can 
write queries that select the data you need, and Iceberg automatically prunes 
out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. 
For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")

Review comment:
       What does this partition spec update do? Seems like a no-op for a field 
rename (other than possibly reassigning IDs for this column? - just a guess on 
that front).
   
   I can appreciate that this is allowed, but unless this call does something 
that I'm not aware of, I think that adding this to the documentation's example 
is potentially more confusing than helpful to those who are learning. 
Otherwise, like I mentioned elsewhere, it's probably good to write out what 
this updateSpec call does as it's not immediately self evident - at least to 
me, though that could be my own limitation and maybe it's clear to others.

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with 
an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to 
write queries for a specific partition layout to be fast. Instead, you can 
write queries that select the data you need, and Iceberg automatically prunes 
out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. 
For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```

Review comment:
       Something to consider:
   
   You might consider stating what this `updateSpec` code is going to do. 
Something like `For example, the following code could be used to update the 
partition spec to bucket `id` column into 8 buckets....`.
   
   Additionally, I think it would be helpful to indicate whether or not this 
changes the old data.
   
   Your added `Sort order evolution` docs say `When you evolve a sort order, 
the old data written with an earlier order remains unchanged.`, to me it begs 
the question of whether or not updating the partition spec via `updateSpec` 
will rewrite old data - and if it does not rewrite old data, what precautions 
do we recommend to people who might use this?

##########
File path: site/docs/evolution.md
##########
@@ -62,3 +62,31 @@ When you evolve a partition spec, the old data written with 
an earlier spec rema
 Iceberg uses [hidden partitioning](./partitioning.md), so you don't *need* to 
write queries for a specific partition layout to be fast. Instead, you can 
write queries that select the data you need, and Iceberg automatically prunes 
out files that don't contain matching data.
 
 Partition evolution is a metadata operation and does not eagerly rewrite files.
+
+Iceberg's Java table API provides `updateSpec` API to update partition spec. 
For example:
+
+```java
+sampleTable.updateSpec()
+    .addField(bucket("id", 8))
+    .renameField("category", "category")
+    .removeField("id_bucket_8", "shard")
+    .commit();
+```
+
+Spark supports updating partition spec through its `ALTER TABLE` SQL 
statement, see more details in [Spark 
SQL](../spark/#alter-table-add-partition-field).
+
+## Sort order evolution
+
+Similar to partition spec, Iceberg sort order can also be updated in an 
existing table.
+When you evolve a sort order, the old data written with an earlier order 
remains unchanged.
+Engines can always choose to write data in the latest sort order or unsorted 
when sorting is prohibitively expensive.

Review comment:
       When a table has a sort order spec, but the older data is not sorted 
according to the spec, can this cause queries to silently return incorrect 
data? Or is this not an issue given that engines can already choose to write 
data sorted or not based on how expensive it's deemed to be.
   
   Possibly this is more elucidated elsewhere, but otherwise I think it would 
be good to clarify if changes to the sort order can cause incorrect query 
results (e.g. if the query engine makes the assumption that data is sorted 
during execution planning).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #2101: Doc: add partition spec and sort order evolution doc

Reply via email to