[GitHub] [iceberg] rajarshisarkar commented on a change in pull request #4377: Spark: Add option to introduce ordering of RewriteFileGroup

GitBox Tue, 29 Mar 2022 04:11:30 -0700


rajarshisarkar commented on a change in pull request #4377:
URL: https://github.com/apache/iceberg/pull/4377#discussion_r837311866




##########
File path: api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
##########
@@ -88,6 +90,19 @@
   String USE_STARTING_SEQUENCE_NUMBER = "use-starting-sequence-number";
   boolean USE_STARTING_SEQUENCE_NUMBER_DEFAULT = true;
 
+  /**
+   * Forces the compaction order based on the value.
+   * <p>
+   * If rewrite.job-order=bytes, then compact based on ascending order of 
partition size in bytes.

Review comment:
       I think these can be the use cases (I can be convinced otherwise):
   - bytes-ascending (avoid larger ones as they are more likely to be 
stragglers).
   - bytes-descending (compact groups with more bytes to gain performance, 
customer can enable `partial-progress.enabled` and set a low max commit to 
compact the bulkiest group greedily. This should be done at a maintenance time 
window where stragglers are not a problem).
   - files-descending (intention would be to reduce number of files to gain 
performance).
   
   I am not sure about the files-ascending use case. I feel we still might be 
missing on many other customer use cases. 
   
   Thought: Iceberg can have some pre-defined rewrite job orders (initial PR 
scope). Can we think of introducing a custom order where the user gets to pass 
the comparator at runtime and we load it via reflection?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rajarshisarkar commented on a change in pull request #4377: Spark: Add option to introduce ordering of RewriteFileGroup

Reply via email to