aokolnychyi commented on a change in pull request #2591:
URL: https://github.com/apache/iceberg/pull/2591#discussion_r649356000
##########
File path: core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java
##########
@@ -162,4 +163,27 @@ private void validateOptions() {
"Cannot set %s is less than 1. All values less than 1 have the same
effect as 1. %d < 1",
MIN_INPUT_FILES, minInputFiles);
}
+
+ protected long targetFileSize() {
+ return this.targetFileSize;
+ }
+
+ /**
+ * Ideally every Spark Task that is generated will be less than or equal to
our target size but
+ * in practice this is not the case. When we actually write our files, they
may exceed the target
+ * size and end up being split. This would end up producing 2 files out of
one task, one target sized
+ * and one very small file. Since the output file can vary in size, it is
better to
+ * use a slightly larger (but still within threshold) size for actually
writing the tasks out.
+ * This helps us in the case where our estimate for the task size is under
the target size but the
+ * actual written file size is slightly larger.
+ * @return the target size plus one half of the distance between max and
target
+ */
+ protected long writeMaxFileSize() {
+ return (long) (this.targetFileSize + ((this.maxFileSize -
this.targetFileSize) * 0.5));
+
Review comment:
nit: extra line
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]