koenvo commented on issue #2159:
URL: 
https://github.com/apache/iceberg-python/issues/2159#issuecomment-3026872310

   This aligns well with the discussion here: 
https://github.com/apache/iceberg-python/issues/2138#issuecomment-2997190853
   
   While there have been improvements to `upsert` - like reducing memory 
pressure and avoiding recursion in `create_match_filter` - performance is still 
suboptimal. This is largely because `upsert` is currently built on top of 
operations like `delete` and `overwrite`.
   
   For example, `overwrite` internally performs a `delete` followed by an 
`append`. The `delete` step may even rewrite entire data files by applying a 
filter and preserving rows that don’t match, then rewriting the result using 
the current schema, partition spec, and sort order.
   
   Interestingly, this "replace" behavior in `delete` is already very close to 
what an `upsert` needs to do - especially when new rows are appended 
immediately after. Here's a simplified view of how files are rewritten:
   
   ```
   df = ArrowScan(...).to_table(...)
   filtered_df = df.filter(preserve_row_filter)
   
   if len(filtered_df) == 0:
       replaced_files.append((original_file.file, []))
   elif len(df) != len(filtered_df):
       replaced_files.append((original_file.file, [...]))
   ```
   
   Because `upsert` reuses these higher-level constructs, we lose the 
opportunity to optimize the operation at a lower level. Treating `upsert` as a 
first-class primitive, like `delete` or `overwrite`, would allow us to optimize 
each step more precisely and avoid unnecessary rewrites, which is especially 
important for large tables.
   
   cc @Fokko 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to