rdblue commented on code in PR #8278:
URL: https://github.com/apache/iceberg/pull/8278#discussion_r1291843352
##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -231,6 +231,10 @@ private TableProperties() {}
public static final String ORC_BATCH_SIZE =
"read.orc.vectorization.batch-size";
public static final int ORC_BATCH_SIZE_DEFAULT = 5000;
+ public static final String DELETE_COLUMN_STATS_FILTERING_ENABLED =
+ "read.delete.column-stats-filtering.enabled";
Review Comment:
I think I wasn't very clear about what I was trying to say. I'll reply to
some of the specific points:
> We want a way to disable column stats filtering for some equality delete
use cases where filtering is not beneficial.
I completely agree that we should disable the filtering. What I'm saying is
that we should _completely_ disable it and not provide the option to turn it on
or off. We don't have an expectation that it is helpful, except in the case
where the min and max are identical.
I also think it would be good to detect the case where the path's `min ==
max` and put those positional delete files in a map. I think that is worth it
because we have two cases for a positional delete file:
1. There are deletes for multiple data files and stats aren't going to help
2. There are deletes for just one file and stats tell us that
Those are very different cases because of the number of delete files. If you
have a strategy to rewrite position deletes into one file per data file, then
you'd have a _lot_ of very narrowly targeted delete files. Using a map keyed by
the delete path for this case makes a lot of sense because of the number of
data files that are removed from the current delete index entirely.
> We want to skip even loading delete stats if they will not be used
I agree in principle, but I don't think this is a huge cost. The data has to
be read into memory anyway, we avoid converting to `String`, and it all runs in
parallel.
The biggest cost by far is comparison --- in part because of the expense of
comparing large strings, but more related to the number of comparisons. We have
to compare each data file against each delete file with a newer sequence
number. That amounts to `O(m*n)` comparisons if we have `m` delete files and
`n` data files. Reducing `m` by separating out `min == max` delete files and
then not running the stats comparison for the remaining position deletes seems
like a really good start.
I could be convinced that we need to avoid even loading the maps, but I'd
like to be convinced by benchmark numbers. Plus, we still need to read the
upper and lower bounds for equality deletes. Not loading stats complicates the
projection.
Another idea is to add the ability to load data for a subset of columns by
ID.
> We may reconsider how our paths are being generated and sometimes include
a temporal part to facilitate filtering
Wouldn't we just be back to loading stats then? If I understand correctly,
this would make the stats check more valuable, but it's more expensive than we
want to even run it right now.
> We want to get rid of any extra overhead (like Stream API) and transforms
if stats filtering is disabled
I agree. Don't we often do this by not adding filters or transforms if they
aren't needed?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]