chenjunjiedada commented on a change in pull request #3454: URL: https://github.com/apache/iceberg/pull/3454#discussion_r741606736
########## File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java ########## @@ -75,7 +75,20 @@ public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes"; public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d; + /** + * The minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting. + * If a data file has more than this number of deletes, it will be rewritten regardless of its file size determined + * by {@link #MIN_FILE_SIZE_BYTES} and {@link #MAX_FILE_SIZE_BYTES}. + * If a file group contains a file that satisfies this condition, the file group will be rewritten regardless of + * the number of files in the file group determined by {@link #MIN_INPUT_FILES} + * <p> + * Defaults to Integer.MAX_VALUE, which means this feature is not enabled by default. + */ + public static final String MIN_DELETES_PER_FILE = "min-deletes-per-file"; Review comment: What value is recommended to the user? Or how does the user compute the proper value? I'm thinking maybe adding an option to count filegroup valid if any file of it contains deletes. Because you don't know how many records match the equality delete, for example, delete a set of records in an area/province. Nit: This is disabled by default but it is always comparing. ########## File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java ########## @@ -75,7 +75,20 @@ public static final String MAX_FILE_SIZE_BYTES = "max-file-size-bytes"; public static final double MAX_FILE_SIZE_DEFAULT_RATIO = 1.80d; + /** + * The minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting. + * If a data file has more than this number of deletes, it will be rewritten regardless of its file size determined + * by {@link #MIN_FILE_SIZE_BYTES} and {@link #MAX_FILE_SIZE_BYTES}. + * If a file group contains a file that satisfies this condition, the file group will be rewritten regardless of + * the number of files in the file group determined by {@link #MIN_INPUT_FILES} + * <p> + * Defaults to Integer.MAX_VALUE, which means this feature is not enabled by default. + */ + public static final String MIN_DELETES_PER_FILE = "min-deletes-per-file"; Review comment: Also, do we need to have specific settings for equality delete and position delete separately? Since the read penalties are different. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org