steven-winfield-quantohm opened a new pull request, #3509:
URL: https://github.com/apache/iceberg-python/pull/3509

   <!--
   Thanks for opening a pull request!
   -->
   
   <!-- In the case this PR will resolve an issue, please replace 
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
   Closes #3508
   
   # Rationale for this change
   
   The problem
   -----------
   When upserting into an Iceberg table, PyIceberg first scans the target table 
to
   find which existing rows match the source rows' key columns.  It builds that
   "matching" predicate in ``pyiceberg.table.upsert_util.create_match_filter``:
   
   * For a **single** join column it emits one flat ``In(col, [v1, v2, ...])``.
     PyArrow lowers this to a single ``is_in`` compute node, no matter how many
     values it contains — so single-column upserts of huge tables are fine.
   
   * For a **multi-column** key it instead emits one disjunct per distinct key
     tuple::
   
         Or(And(c1 == v1, c2 == w1),
            And(c1 == v2, c2 == w2),
            ...)                          # ONE disjunct PER ROW
   
   PyIceberg builds that ``Or`` as a balanced tree, so the *Python* side copes.
   But when the expression is handed to PyArrow's dataset scanner as a filter, 
the
   C++ expression engine canonicalises it: ``Dataset::GetFragments`` calls
   ``SimplifyWithGuarantee`` → ``Canonicalize``, which flattens the associative
   ``or_kleene`` chain and then **recurses** over it.  With tens of thousands of
   disjuncts that recursion overflows the C++ call stack and the **process
   segfaults** (SIGSEGV) — typically after several minutes of work, with a
   backtrace full of ``arrow::compute::Canonicalize`` / ``ModifyExpression``
   frames.
   
   Reference: https://github.com/apache/iceberg-python/issues/3272
   
   Note that apache/iceberg-python#3448 addresses a *different* upsert segfault 
(a
   per-batch Acero re-filter in ``_task_to_record_batches``, mostly observed on
   Apple Silicon).  It does not touch the ``GetFragments`` canonicalisation path
   exercised here, so it does not help with this crash.
   
   The fix
   -------
   Produce a predicate that matches exactly the same rows, but with far fewer
   disjuncts.  Group the key tuples and emit a single ``In`` over whichever 
column
   collapses to the fewest distinct "prefix" combinations (choosing that column
   makes the result independent of the caller's column ordering)::
   
       Or(And(c1 == v1, c2 IN [w, x, y]),
          And(c1 == v2, c2 IN [z]),
          ...)                            # one disjunct per distinct PREFIX
   
   The disjunct count drops from "number of rows" to "number of distinct prefix
   values".  In the synthetic data below there are 50 000 unique ids spread over
   just 50 group values, so the predicate shrinks from 50 000 disjuncts to 50 —
   shallow enough that PyArrow's canonicaliser no longer overflows.
   
   Caveat
   ------
   This helps whenever at least one key column is low-cardinality (or, 
equivalently,
   one column is near-unique and can be folded into the ``In``).  A genuinely
   high-cardinality *composite* key — where every column is near-unique and all 
of
   them are needed to identify a row — still produces roughly one disjunct per 
row
   even after grouping, and can still overflow.  For that pathological case the
   only robust option is to upsert in smaller batches.
   
   ## Are these changes tested?
   
   Yes - one test changed to expect `And(op1, op2)` or `And(op2, op1)` where 
previously
   the operand order mattered.
   
   ## Are there any user-facing changes?
   
   No
   
   <!-- In the case of user-facing changes, please add the changelog label. -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to