(iceberg) branch main updated: Docs: Add schema selection example for time travel queries (#14825)

kevinjqliu Tue, 16 Dec 2025 08:33:53 -0800

This is an automated email from the ASF dual-hosted git repository.

kevinjqliu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/main by this push:
     new d5bfcaf7ca Docs: Add schema selection example for time travel queries 
(#14825)
d5bfcaf7ca is described below

commit d5bfcaf7ca518d70a4cb47167c948a4a367df9e3
Author: Vamsi Krishna <[email protected]>
AuthorDate: Tue Dec 16 22:03:38 2025 +0530

    Docs: Add schema selection example for time travel queries (#14825)
---
 docs/docs/spark-queries.md | 74 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/docs/docs/spark-queries.md b/docs/docs/spark-queries.md
index 41189d05ff..ce626f42b1 100644
--- a/docs/docs/spark-queries.md
+++ b/docs/docs/spark-queries.md
@@ -116,6 +116,80 @@ SELECT * FROM prod.db.table VERSION AS OF 
'historical-snapshot';
 SELECT * FROM prod.db.table.`tag_historical-snapshot`;
 ```
 
+For example, consider a table that evolves its schema over time, and see how 
each type of time travel query selects its schema:
+
+```sql
+-- snapshot S1: initial schema (id, status)
+CREATE TABLE prod.db.orders (
+  id BIGINT,
+  status STRING
+) USING iceberg;
+
+INSERT INTO prod.db.orders VALUES (1, 'NEW'), (2, 'PAID');
+
+-- record snapshot S1's snapshot_id and committed_at timestamp
+-- e.g. snapshot_id = 101, committed_at = '2025-01-01 10:00:00'
+
+-- snapshot S2: add a new column "total" and write new data
+ALTER TABLE prod.db.orders ADD COLUMN total DOUBLE;
+
+INSERT INTO prod.db.orders VALUES (3, 'PAID', 100.0);
+
+-- now S2 is the current snapshot with schema (id, status, total)
+```
+
+Time travel queries that select a specific snapshot or timestamp use the
+snapshot's schema:
+
+```sql
+-- uses the snapshot schema of S1: columns (id, status)
+SELECT * FROM prod.db.orders VERSION AS OF 101;
+
+SELECT * FROM prod.db.orders TIMESTAMP AS OF '2025-01-01 10:00:00';
+```
+
+In both queries above, the result only has `id` and `status`. The `total`
+column does not exist in the S1 schema and is not visible, even though the
+current table schema includes `total`.
+
+Now create a branch and a tag that both reference S1:
+
+```sql
+-- branch "audit_branch" points to snapshot S1
+ALTER TABLE prod.db.orders CREATE BRANCH audit_branch AS OF VERSION 101;
+
+-- tag "first_load" also points to snapshot S1
+ALTER TABLE prod.db.orders CREATE TAG first_load AS OF VERSION 101;
+```
+
+When you query a branch, Spark uses the table's current schema:
+
+```sql
+-- uses the table schema: columns (id, status, total)
+SELECT * FROM prod.db.orders VERSION AS OF 'audit_branch';
+
+-- equivalent identifier form
+SELECT * FROM prod.db.orders.`branch_audit_branch`;
+```
+
+In these queries, the result has columns `(id, status, total)`. For the rows
+from S1, `total` is returned as `NULL` because that column did not exist when
+those rows were written.
+
+When you query a tag, Spark uses the snapshot's schema referenced by the tag:
+
+```sql
+-- uses the snapshot schema of S1: columns (id, status)
+SELECT * FROM prod.db.orders VERSION AS OF 'first_load';
+
+-- equivalent identifier form
+SELECT * FROM prod.db.orders.`tag_first_load`;
+```
+
+These queries only return `id` and `status`, because tags are bound to a
+specific snapshot and use that snapshot's schema, even if the table's current
+schema has evolved.
+
 ## Querying with DataFrames
 
 To load a table as a DataFrame, use `table`:

(iceberg) branch main updated: Docs: Add schema selection example for time travel queries (#14825)

Reply via email to