wypoon commented on PR #5742:
URL: https://github.com/apache/iceberg/pull/5742#issuecomment-1260027746

   The code paths that goes through the streaming filters -- 
`Deletes.streamingMarker` and `Deletes.streamingFilter` --
   ```
       return hasIsDeletedColumn
           ? Deletes.streamingMarker(
               records, this::pos, Deletes.deletePositions(filePath, deletes), 
this::markRowDeleted)
           : Deletes.streamingFilter(
               records, this::pos, Deletes.deletePositions(filePath, deletes), 
counter);
   ```
   are called in `DeleteFilter#applyPosDeletes`, which is called by 
`DeleteFilter#filter`.
   In the two streaming filters, there is a counter with state.
   `this::markRowDeleted` ends up being 
`BaseReader.SparkDeleteFilter#markRowDeleted`, which uses the counter in the 
reader. In the other one, the counter is again the counter from the reader 
which is passed into the constructor of the `DeleteFilter`.
   In addition, `BaseReader.SparkDeleteFilter#markRowDeleted` has to account 
for additional state:
   ```
         if (!row.getBoolean(columnIsDeletedPosition())) {
           row.setBoolean(columnIsDeletedPosition(), true);
           counter().increment();
         }
   ```
   In order to avoid double-counting, it checks if the row has already been 
marked deleted. This happens to come into play in `Deletes.markDeleted` which 
is called by (among others) `DeleteFilter#applyEqDeletes(CloseableIterable<T>)` 
via `DeleteFilter#createDeleteIterable`. In this case, the applyEqDeletes is 
applied after the applyPosDeletes, but the point is that there is a lot of 
interaction of different parts of **stateful** code. Unless we are very clear 
that we know exactly what to test for, it is not obvious that just testing a 
call to `Deletes.streamingMarker` or a call to `Deletes.streamingFilter` has 
exercised the logic correctly.
   That is why I think testing at the level of `TestSparkReaderDeletes` makes 
it obvious what we've exercised.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to