abnobdoss opened a new pull request, #3420: URL: https://github.com/apache/iceberg-python/pull/3420
# Rationale for this change This is part 1 of 2 for fixing upsert segfault / recursion failures caused by large composite-key match filters. It should also improve planning performance because the current exact PyIceberg expression tree is expensive to construct, bind, and evaluate for large source batches. For composite join keys, the initial upsert scan currently builds one `And(EqualTo(...), EqualTo(...))` predicate per unique source key and reduces them into a left-leaning `Or` tree. Large source batches therefore create deep recursive expression trees before the target scan starts; expression visitors then walk `Or.left` / `Or.right` recursively, so stack usage scales with the number of unique key combinations. This change adds a bounded `create_file_match_filter` for the initial target file-pruning scan. It uses per-column min/max bounds, plus `IsNull` where needed, so the initial scan predicate size is tied to the number of join columns rather than the number of source rows. The exact `create_match_filter` is still used later for overwrite and insert filtering, so row-level correctness is unchanged. The tradeoff is that the bounded file-pruning predicate is conservative: for composite keys it can scan cross-product false positives, which downstream exact matching culls before update/insert decisions. This can technically increase the number of files scanned, but the current failure mode makes upsert impossible for affected large composite-key inputs. The intent of this PR is to remove that crash first; if there is appetite to improve pruning further, a follow-up can use a more selective bounding-box or more sophisticated algorithm while preserving the bounded expression shape. Match-filter timings below include expression construction, schema binding, PyArrow expression conversion, and applying the resulting Arrow filter to an in-memory table. These do not measure end-to-end upsert time, which depends on table layout, file pruning, IO, and downstream exact matching. Benchmark script: https://gist.github.com/abnobdoss/824698ea74c29c02ead8da6410b89b6e Current exact match-filter path: | Unique source keys | 1 join column | 2 join columns | 3 join columns | | --- | --- | --- | --- | | 100 | 1.5 ms | 7.3 ms | 11 ms | | 1k | 4.7 ms | 87 ms | 147 ms | | 10k | 33 ms | SIGSEGV | SIGSEGV | | 100k | 650 ms | SIGSEGV | SIGSEGV | | 1m | 6.59 s | >30 s | >30 s | Bounded file-match filter path after this PR: | Unique source keys | 1 join column | 2 join columns | 3 join columns | | --- | --- | --- | --- | | 100 | 1.0 ms | 1.1 ms | 1.2 ms | | 1k | 1.0 ms | 1.1 ms | 1.2 ms | | 10k | 1.1 ms | 1.1 ms | 1.3 ms | | 100k | 1.3 ms | 1.6 ms | 1.9 ms | | 1m | 3.7 ms | 6.8 ms | 9.0 ms | Follow-up work should address the remaining exact match-filter recursion paths used later during overwrite and insert filtering. Those predicates need exact semantics, so they should be fixed separately. Related upsert performance / recursion reports and prior work: #2159, #2675, #3129, #2943, #3387. ## Are these changes tested? Yes. Added regression coverage for the large composite-key initial scan path, plus focused coverage for bounded file-match filter construction and the conservative false-positive culling path. Ran: - `uv run python -m pytest tests/table/test_upsert.py -k "large_composite_key_initial_scan or file_match_filter or create_match_filter"` - `uv run python -m pytest tests/table/test_upsert.py` - commit hooks The added `test_upsert_large_composite_key_initial_scan_does_not_recurse` regression test runs the large composite-key upsert in a subprocess so runtimes that would segfault report a pytest failure instead of terminating the pytest worker; the equivalent base-branch path exits `139` under Python 3.10, while this branch exits successfully. ## Are there any user-facing changes? No API changes. Upserts with large composite-key source batches should avoid the initial scan-planning recursion failure and may scan a conservative superset of candidate files before exact row matching. --- AI assistance was used during development to help explore the failure mode, implement the change, and add tests. I reviewed the outputs throughout and validated the final changes locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
