tanmayrauth commented on issue #3498:
URL: 
https://github.com/apache/iceberg-python/issues/3498#issuecomment-4701086673

   I went through visitors.py and the call sites, and three of the four hold 
up. The big one is #1: visit_not_equal and visit_not_in short-circuit on  
_can_contain_nulls/_can_contain_nans and return ROWS_MUST_MATCH without ever 
checking the bounds, where every other strict method (and Java) uses the 
_contains_*_only variants. Since the strict evaluator drives _DeleteFiles 
(snapshot.py:460), this isn't just over-pruning — a delete(NotEqualTo("x", 5)) 
against a file with stats [null, 5] would drop the whole file and take the 
legitimate 5 row with it, so it's effectively silent data loss. #3 and #4 are 
real too: the ResidualVisitor comparisons hit None < literal and raise 
TypeError where the row evaluator safely guards for None, and 
visit_not_nan(None) returns AlwaysFalse while the row evaluator treats it as 
true — both reachable on the scan path via residual_for. The only one I'd push 
back on is #2; the record_count <= 0 behavior matches upstream Java and looks  
intentional, though
  the comment could be clearer about the -1 case. 
   
   If nobody's already on this, I'd like to take it, starting with #1 since 
it's the data-loss path, with a regression test asserting the 5 row survives 
the delete, then a follow-up for the ResidualVisitor null handling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to