Re: [PR] refactor: consolidate snapshot expiration into MaintenanceTable [iceberg-python]

via GitHub Fri, 08 Aug 2025 06:16:38 -0700


ForeverAngry commented on code in PR #2143:
URL: https://github.com/apache/iceberg-python/pull/2143#discussion_r2262828062



##########
mkdocs/docs/api.md:
##########
@@ -995,513 +995,114 @@ readable_metrics: [
 [6.0989]]
 ```
 
-!!! info
-    Content refers to type of content stored by the data file: `0` - `Data`, 
`1` - `Position Deletes`, `2` - `Equality Deletes`
+## Table Maintenance
 
-To show only data files or delete files in the current snapshot, use 
`table.inspect.data_files()` and `table.inspect.delete_files()` respectively.
+PyIceberg provides a set of maintenance utilities to help keep your tables 
healthy, efficient, and resilient. These operations are available via the 
`MaintenanceTable` class and are essential for managing metadata, reclaiming 
space, and ensuring operational safety.
 
-## Add Files
+### Use Cases
 
-Expert Iceberg users may choose to commit existing parquet files to the 
Iceberg table as data files, without rewriting them.
+- **Deduplicate Data Files**: Remove duplicate references to the same physical 
data file, which can occur due to concurrent writes, manual file additions, or 
recovery from failures.
+- **Snapshot Retention**: Control the number and age of snapshots retained for 
rollback, auditing, and space management.
+- **Safe Expiration**: Ensure that protected snapshots (e.g., branch/tag 
heads) are never accidentally removed.
 
-```python
-# Given that these parquet files have schema consistent with the Iceberg table
-
-file_paths = [
-    "s3a://warehouse/default/existing-1.parquet",
-    "s3a://warehouse/default/existing-2.parquet",
-]
-
-# They can be added to the table without rewriting them
-
-tbl.add_files(file_paths=file_paths)
-
-# A new snapshot is committed to the table with manifests pointing to the 
existing parquet files
-```
-
-<!-- prettier-ignore-start -->
-
-!!! note "Name Mapping"
-    Because `add_files` uses existing files without writing new parquet files 
that are aware of the Iceberg's schema, it requires the Iceberg's table to have 
a [Name 
Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization)
 (The Name mapping maps the field names within the parquet files to the Iceberg 
field IDs). Hence, `add_files` requires that there are no field IDs in the 
parquet file's metadata, and creates a new Name Mapping based on the table's 
current schema if the table doesn't already have one.
-
-!!! note "Partitions"
-    `add_files` only requires the client to read the existing parquet files' 
metadata footer to infer the partition value of each file. This implementation 
also supports adding files to Iceberg tables with partition transforms like 
`MonthTransform`, and `TruncateTransform` which preserve the order of the 
values after the transformation (Any Transform that has the `preserves_order` 
property set to True is supported). Please note that if the column statistics 
of the `PartitionField`'s source column are not present in the parquet 
metadata, the partition value is inferred as `None`.
-
-!!! warning "Maintenance Operations"
-    Because `add_files` commits the existing parquet files to the Iceberg 
Table as any other data file, destructive maintenance operations like expiring 
snapshots will remove them.
-
-<!-- prettier-ignore-end -->
-
-## Schema evolution
-
-PyIceberg supports full schema evolution through the Python API. It takes care 
of setting the field-IDs and makes sure that only non-breaking changes are done 
(can be overridden).
-
-In the examples below, the `.update_schema()` is called from the table itself.
-
-```python
-with table.update_schema() as update:
-    update.add_column("some_field", IntegerType(), "doc")
-```
-
-You can also initiate a transaction if you want to make more changes than just 
evolving the schema:
-
-```python
-with table.transaction() as transaction:
-    with transaction.update_schema() as update_schema:
-        update.add_column("some_other_field", IntegerType(), "doc")
-    # ... Update properties etc
-```
-
-### Union by Name
-
-Using `.union_by_name()` you can merge another schema into an existing schema 
without having to worry about field-IDs:
-
-```python
-from pyiceberg.catalog import load_catalog
-from pyiceberg.schema import Schema
-from pyiceberg.types import NestedField, StringType, DoubleType, LongType
-
-catalog = load_catalog()
-
-schema = Schema(
-    NestedField(1, "city", StringType(), required=False),
-    NestedField(2, "lat", DoubleType(), required=False),
-    NestedField(3, "long", DoubleType(), required=False),
-)
-
-table = catalog.create_table("default.locations", schema)
-
-new_schema = Schema(
-    NestedField(1, "city", StringType(), required=False),
-    NestedField(2, "lat", DoubleType(), required=False),
-    NestedField(3, "long", DoubleType(), required=False),
-    NestedField(10, "population", LongType(), required=False),
-)
-
-with table.update_schema() as update:
-    update.union_by_name(new_schema)
-```
-
-Now the table has the union of the two schemas `print(table.schema())`:
-
-```python
-table {
-  1: city: optional string
-  2: lat: optional double
-  3: long: optional double
-  4: population: optional long
-}
-```
-
-### Add column
-
-Using `add_column` you can add a column, without having to worry about the 
field-id:
-
-```python
-with table.update_schema() as update:
-    update.add_column("retries", IntegerType(), "Number of retries to place 
the bid")
-    # In a struct
-    update.add_column("details", StructType())
-
-with table.update_schema() as update:
-    update.add_column(("details", "confirmed_by"), StringType(), "Name of the 
exchange")
-```
-
-A complex type must exist before columns can be added to it. Fields in complex 
types are added in a tuple.
-
-### Rename column
-
-Renaming a field in an Iceberg table is simple:
-
-```python
-with table.update_schema() as update:
-    update.rename_column("retries", "num_retries")
-    # This will rename `confirmed_by` to `processed_by` in the `details` struct
-    update.rename_column(("details", "confirmed_by"), "processed_by")
-```
-
-### Move column
-
-Move order of fields:
-
-```python
-with table.update_schema() as update:
-    update.move_first("symbol")
-    # This will move `bid` after `ask`
-    update.move_after("bid", "ask")
-    # This will move `confirmed_by` before `exchange` in the `details` struct
-    update.move_before(("details", "confirmed_by"), ("details", "exchange"))
-```
-
-### Update column
-
-Update a fields' type, description or required.
-
-```python
-with table.update_schema() as update:
-    # Promote a float to a double
-    update.update_column("bid", field_type=DoubleType())
-    # Make a field optional
-    update.update_column("symbol", required=False)
-    # Update the documentation
-    update.update_column("symbol", doc="Name of the share on the exchange")
-```
-
-Be careful, some operations are not compatible, but can still be done at your 
own risk by setting `allow_incompatible_changes`:
-
-```python
-with table.update_schema(allow_incompatible_changes=True) as update:
-    # Incompatible change, cannot require an optional field
-    update.update_column("symbol", required=True)
-```
-
-### Delete column
-
-Delete a field, careful this is a incompatible change (readers/writers might 
expect this field):
-
-```python
-with table.update_schema(allow_incompatible_changes=True) as update:
-    update.delete_column("some_field")
-    # In a struct
-    update.delete_column(("details", "confirmed_by"))
-```
-
-## Partition evolution
-
-PyIceberg supports partition evolution. See the [partition 
evolution](https://iceberg.apache.org/spec/#partition-evolution)
-for more details.
-
-The API to use when evolving partitions is the `update_spec` API on the table.
-
-```python
-with table.update_spec() as update:
-    update.add_field("id", BucketTransform(16), "bucketed_id")
-    update.add_field("event_ts", DayTransform(), "day_ts")
-```
-
-Updating the partition spec can also be done as part of a transaction with 
other operations.
-
-```python
-with table.transaction() as transaction:
-    with transaction.update_spec() as update_spec:
-        update_spec.add_field("id", BucketTransform(16), "bucketed_id")
-        update_spec.add_field("event_ts", DayTransform(), "day_ts")
-    # ... Update properties etc
-```
-
-### Add fields

Review Comment:
   Ooops. Ill fix that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] refactor: consolidate snapshot expiration into MaintenanceTable [iceberg-python]

Reply via email to