rdblue commented on code in PR #8278:
URL: https://github.com/apache/iceberg/pull/8278#discussion_r1291843352


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -231,6 +231,10 @@ private TableProperties() {}
   public static final String ORC_BATCH_SIZE = 
"read.orc.vectorization.batch-size";
   public static final int ORC_BATCH_SIZE_DEFAULT = 5000;
 
+  public static final String DELETE_COLUMN_STATS_FILTERING_ENABLED =
+      "read.delete.column-stats-filtering.enabled";

Review Comment:
   I think I wasn't very clear about what I was trying to say. I'll reply to 
some of the specific points:
   
   > We want a way to disable column stats filtering for some equality delete 
use cases where filtering is not beneficial.
   
   I completely agree that we should disable the filtering. What I'm saying is 
that we should _completely_ disable it and not provide the option to turn it on 
or off. We don't have an expectation that it is helpful, except in the case 
where the min and max are identical.
   
   I also think it would be good to detect the case where the path's `min == 
max` and put those positional delete files in a map. I think that is worth it 
because we have two cases for a positional delete file:
   1. There are deletes for multiple data files and stats aren't going to help
   2. There are deletes for just one file and stats tell us that
   
   Those are very different cases because of the number of delete files. If you 
have a strategy to rewrite position deletes into one file per data file, then 
you'd have a _lot_ of very narrowly targeted delete files. Using a map keyed by 
the delete path for this case makes a lot of sense because of the number of 
data files that are removed from the current delete index entirely.
   
   > We want to skip even loading delete stats if they will not be used
   
   I agree in principle, but I don't think this is a huge cost. The data has to 
be read into memory anyway, we avoid converting to `String`, and it all runs in 
parallel.
   
   The biggest cost by far is comparison --- in part because of the expense of 
comparing large strings, but more related to the number of comparisons. We have 
to compare each data file against each delete file with a newer sequence 
number. That amounts to `O(m*n)` comparisons if we have `m` delete files and 
`n` data files. Reducing `m` by separating out `min == max` delete files and 
then not running the stats comparison for the remaining position deletes seems 
like a really good start.
   
   I could be convinced that we need to avoid even loading the maps, but I'd 
like to be convinced by benchmark numbers. Plus, we still need to read the 
upper and lower bounds for equality deletes. Not loading stats complicates the 
projection.
   
   Another idea is to add the ability to load data for a subset of columns by 
ID.
   
   > We may reconsider how our paths are being generated and sometimes include 
a temporal part to facilitate filtering
   
   Wouldn't we just be back to loading stats then? If I understand correctly, 
this would make the stats check more valuable, but it's more expensive than we 
want to even run it right now.
   
   > We want to get rid of any extra overhead (like Stream API) and transforms 
if stats filtering is disabled
   
   I agree. Don't we often do this by not adding filters or transforms if they 
aren't needed?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to