Re: [PR] docs: clarify Parameters for the add_files API [iceberg-python]

via GitHub Mon, 28 Jul 2025 22:06:09 -0700


kevinjqliu commented on code in PR #2249:
URL: https://github.com/apache/iceberg-python/pull/2249#discussion_r2238530443



##########
mkdocs/docs/api.md:
##########
@@ -1004,6 +1004,33 @@ To show only data files or delete files in the current 
snapshot, use `table.insp
 
 Expert Iceberg users may choose to commit existing parquet files to the 
Iceberg table as data files, without rewriting them.
 
+<!-- prettier-ignore-start -->
+
+!!! note "Name Mapping"
+Because `add_files` uses existing files without writing new parquet files that 
are aware of the Iceberg's schema, it requires the Iceberg's table to have a 
[Name 
Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization)
 (The Name mapping maps the field names within the parquet files to the Iceberg 
field IDs). Hence, `add_files` requires that there are no field IDs in the 
parquet file's metadata, and creates a new Name Mapping based on the table's 
current schema if the table doesn't already have one.

Review Comment:
   all the paragraphs with `!!!` are missing the 4 spaces, like how they are 
written initially
   
   the spaces are there to render these boxes correctly
   ![Screenshot 2025-07-28 at 10 01 46 
PM](https://github.com/user-attachments/assets/a25d571c-137b-4755-a3fe-89b8d7863fb7)
   
   could you add them back?



##########
mkdocs/docs/api.md:
##########
@@ -1004,6 +1004,33 @@ To show only data files or delete files in the current 
snapshot, use `table.insp
 
 Expert Iceberg users may choose to commit existing parquet files to the 
Iceberg table as data files, without rewriting them.
 
+<!-- prettier-ignore-start -->
+
+!!! note "Name Mapping"
+Because `add_files` uses existing files without writing new parquet files that 
are aware of the Iceberg's schema, it requires the Iceberg's table to have a 
[Name 
Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization)
 (The Name mapping maps the field names within the parquet files to the Iceberg 
field IDs). Hence, `add_files` requires that there are no field IDs in the 
parquet file's metadata, and creates a new Name Mapping based on the table's 
current schema if the table doesn't already have one.
+
+!!! note "Partitions"
+`add_files` only requires the client to read the existing parquet files' 
metadata footer to infer the partition value of each file. This implementation 
also supports adding files to Iceberg tables with partition transforms like 
`MonthTransform`, and `TruncateTransform` which preserve the order of the 
values after the transformation (Any Transform that has the `preserves_order` 
property set to True is supported). Please note that if the column statistics 
of the `PartitionField`'s source column are not present in the parquet 
metadata, the partition value is inferred as `None`.
+
+!!! warning "Maintenance Operations"
+Because `add_files` commits the existing parquet files to the Iceberg Table as 
any other data file, destructive maintenance operations like expiring snapshots 
will remove them.
+
+!!! warning "Check Duplicate Files"
+The `check_duplicate_files` parameter is `True` by default and will check the 
new files against the existing Iceberg table data files to prevent duplicates. 
This check can be expensive for large tables with many files. It is recommended 
to use the default configuration. The check can be turned off by setting 
`check_duplicate_files=False`, but this may result in duplicate files being 
added to the table, which can lead to data consistency issues and potential 
table corruption if the same data file is added multiple times.

Review Comment:
   ```suggestion
   The `check_duplicate_files` parameter controls whether the method checks if 
any of the provided `file_paths` are already present in the Iceberg table. By 
default, it is set to `True`, which performs a validation against the table’s 
current data files to prevent accidental duplication.
   
   This check helps maintain data consistency by ensuring that the same data 
file is not added multiple times. However, for tables with a large number of 
files, this validation can be expensive in terms of performance.
   
   To skip the duplicate check, set `check_duplicate_files=False`. This can 
improve performance but increases the risk of introducing duplicate files, 
which may lead to data inconsistency or table corruption if the same file is 
added more than once.
   ```
   
   I used LLM to generate this based on the function definition. WDYT? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: clarify Parameters for the add_files API [iceberg-python]

Reply via email to