ForeverAngry commented on code in PR #2143: URL: https://github.com/apache/iceberg-python/pull/2143#discussion_r2262828062
########## mkdocs/docs/api.md: ########## @@ -995,513 +995,114 @@ readable_metrics: [ [6.0989]] ``` -!!! info - Content refers to type of content stored by the data file: `0` - `Data`, `1` - `Position Deletes`, `2` - `Equality Deletes` +## Table Maintenance -To show only data files or delete files in the current snapshot, use `table.inspect.data_files()` and `table.inspect.delete_files()` respectively. +PyIceberg provides a set of maintenance utilities to help keep your tables healthy, efficient, and resilient. These operations are available via the `MaintenanceTable` class and are essential for managing metadata, reclaiming space, and ensuring operational safety. -## Add Files +### Use Cases -Expert Iceberg users may choose to commit existing parquet files to the Iceberg table as data files, without rewriting them. +- **Deduplicate Data Files**: Remove duplicate references to the same physical data file, which can occur due to concurrent writes, manual file additions, or recovery from failures. +- **Snapshot Retention**: Control the number and age of snapshots retained for rollback, auditing, and space management. +- **Safe Expiration**: Ensure that protected snapshots (e.g., branch/tag heads) are never accidentally removed. -```python -# Given that these parquet files have schema consistent with the Iceberg table - -file_paths = [ - "s3a://warehouse/default/existing-1.parquet", - "s3a://warehouse/default/existing-2.parquet", -] - -# They can be added to the table without rewriting them - -tbl.add_files(file_paths=file_paths) - -# A new snapshot is committed to the table with manifests pointing to the existing parquet files -``` - -<!-- prettier-ignore-start --> - -!!! note "Name Mapping" - Because `add_files` uses existing files without writing new parquet files that are aware of the Iceberg's schema, it requires the Iceberg's table to have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) (The Name mapping maps the field names within the parquet files to the Iceberg field IDs). Hence, `add_files` requires that there are no field IDs in the parquet file's metadata, and creates a new Name Mapping based on the table's current schema if the table doesn't already have one. - -!!! note "Partitions" - `add_files` only requires the client to read the existing parquet files' metadata footer to infer the partition value of each file. This implementation also supports adding files to Iceberg tables with partition transforms like `MonthTransform`, and `TruncateTransform` which preserve the order of the values after the transformation (Any Transform that has the `preserves_order` property set to True is supported). Please note that if the column statistics of the `PartitionField`'s source column are not present in the parquet metadata, the partition value is inferred as `None`. - -!!! warning "Maintenance Operations" - Because `add_files` commits the existing parquet files to the Iceberg Table as any other data file, destructive maintenance operations like expiring snapshots will remove them. - -<!-- prettier-ignore-end --> - -## Schema evolution - -PyIceberg supports full schema evolution through the Python API. It takes care of setting the field-IDs and makes sure that only non-breaking changes are done (can be overridden). - -In the examples below, the `.update_schema()` is called from the table itself. - -```python -with table.update_schema() as update: - update.add_column("some_field", IntegerType(), "doc") -``` - -You can also initiate a transaction if you want to make more changes than just evolving the schema: - -```python -with table.transaction() as transaction: - with transaction.update_schema() as update_schema: - update.add_column("some_other_field", IntegerType(), "doc") - # ... Update properties etc -``` - -### Union by Name - -Using `.union_by_name()` you can merge another schema into an existing schema without having to worry about field-IDs: - -```python -from pyiceberg.catalog import load_catalog -from pyiceberg.schema import Schema -from pyiceberg.types import NestedField, StringType, DoubleType, LongType - -catalog = load_catalog() - -schema = Schema( - NestedField(1, "city", StringType(), required=False), - NestedField(2, "lat", DoubleType(), required=False), - NestedField(3, "long", DoubleType(), required=False), -) - -table = catalog.create_table("default.locations", schema) - -new_schema = Schema( - NestedField(1, "city", StringType(), required=False), - NestedField(2, "lat", DoubleType(), required=False), - NestedField(3, "long", DoubleType(), required=False), - NestedField(10, "population", LongType(), required=False), -) - -with table.update_schema() as update: - update.union_by_name(new_schema) -``` - -Now the table has the union of the two schemas `print(table.schema())`: - -```python -table { - 1: city: optional string - 2: lat: optional double - 3: long: optional double - 4: population: optional long -} -``` - -### Add column - -Using `add_column` you can add a column, without having to worry about the field-id: - -```python -with table.update_schema() as update: - update.add_column("retries", IntegerType(), "Number of retries to place the bid") - # In a struct - update.add_column("details", StructType()) - -with table.update_schema() as update: - update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange") -``` - -A complex type must exist before columns can be added to it. Fields in complex types are added in a tuple. - -### Rename column - -Renaming a field in an Iceberg table is simple: - -```python -with table.update_schema() as update: - update.rename_column("retries", "num_retries") - # This will rename `confirmed_by` to `processed_by` in the `details` struct - update.rename_column(("details", "confirmed_by"), "processed_by") -``` - -### Move column - -Move order of fields: - -```python -with table.update_schema() as update: - update.move_first("symbol") - # This will move `bid` after `ask` - update.move_after("bid", "ask") - # This will move `confirmed_by` before `exchange` in the `details` struct - update.move_before(("details", "confirmed_by"), ("details", "exchange")) -``` - -### Update column - -Update a fields' type, description or required. - -```python -with table.update_schema() as update: - # Promote a float to a double - update.update_column("bid", field_type=DoubleType()) - # Make a field optional - update.update_column("symbol", required=False) - # Update the documentation - update.update_column("symbol", doc="Name of the share on the exchange") -``` - -Be careful, some operations are not compatible, but can still be done at your own risk by setting `allow_incompatible_changes`: - -```python -with table.update_schema(allow_incompatible_changes=True) as update: - # Incompatible change, cannot require an optional field - update.update_column("symbol", required=True) -``` - -### Delete column - -Delete a field, careful this is a incompatible change (readers/writers might expect this field): - -```python -with table.update_schema(allow_incompatible_changes=True) as update: - update.delete_column("some_field") - # In a struct - update.delete_column(("details", "confirmed_by")) -``` - -## Partition evolution - -PyIceberg supports partition evolution. See the [partition evolution](https://iceberg.apache.org/spec/#partition-evolution) -for more details. - -The API to use when evolving partitions is the `update_spec` API on the table. - -```python -with table.update_spec() as update: - update.add_field("id", BucketTransform(16), "bucketed_id") - update.add_field("event_ts", DayTransform(), "day_ts") -``` - -Updating the partition spec can also be done as part of a transaction with other operations. - -```python -with table.transaction() as transaction: - with transaction.update_spec() as update_spec: - update_spec.add_field("id", BucketTransform(16), "bucketed_id") - update_spec.add_field("event_ts", DayTransform(), "day_ts") - # ... Update properties etc -``` - -### Add fields Review Comment: Ooops. Ill fix that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org