koenvo commented on issue #2159: URL: https://github.com/apache/iceberg-python/issues/2159#issuecomment-3026872310
This aligns well with the discussion here: https://github.com/apache/iceberg-python/issues/2138#issuecomment-2997190853 While there have been improvements to `upsert` - like reducing memory pressure and avoiding recursion in `create_match_filter` - performance is still suboptimal. This is largely because `upsert` is currently built on top of operations like `delete` and `overwrite`. For example, `overwrite` internally performs a `delete` followed by an `append`. The `delete` step may even rewrite entire data files by applying a filter and preserving rows that don’t match, then rewriting the result using the current schema, partition spec, and sort order. Interestingly, this "replace" behavior in `delete` is already very close to what an `upsert` needs to do - especially when new rows are appended immediately after. Here's a simplified view of how files are rewritten: ``` df = ArrowScan(...).to_table(...) filtered_df = df.filter(preserve_row_filter) if len(filtered_df) == 0: replaced_files.append((original_file.file, [])) elif len(df) != len(filtered_df): replaced_files.append((original_file.file, [...])) ``` Because `upsert` reuses these higher-level constructs, we lose the opportunity to optimize the operation at a lower level. Treating `upsert` as a first-class primitive, like `delete` or `overwrite`, would allow us to optimize each step more precisely and avoid unnecessary rewrites, which is especially important for large tables. cc @Fokko -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
