[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3292: Spark: Compact Medium Size Files (#460)

GitBox Sat, 16 Oct 2021 17:46:20 -0700


RussellSpitzer commented on a change in pull request #3292:
URL: https://github.com/apache/iceberg/pull/3292#discussion_r729979609




##########
File path: 
spark3/src/main/java/org/apache/iceberg/spark/source/SparkFilesScan.java
##########
@@ -37,9 +37,10 @@
 
 class SparkFilesScan extends SparkBatchScan {
   private final String taskSetID;
-  private final long splitSize;
-  private final int splitLookback;
-  private final long splitOpenFileCost;
+  private final Long readSplitSize;
+  private final Long planTargetSize;
+  private final Integer splitLookback;

Review comment:
       ah no reason, I can change them back

##########
File path: 
spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java
##########
@@ -61,9 +61,15 @@ public Table table() {
       SparkSession cloneSession = spark.cloneSession();
       cloneSession.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), 
false);
 
+      long targetReadSize = splitSize(inputFileSize(filesToRewrite));
+      // Ideally this would be the row-group size but the row group size is 
not guaranteed to be consistent

Review comment:
       Sure, my main goal here was that we want pieces from a file that is 
greater than 50% of the target file size. Whether or not the current row group 
size is compatible with that, just dividing by four would give us a chance to 
get pieces that fit.

##########
File path: 
spark3/src/main/java/org/apache/iceberg/spark/source/SparkFilesScan.java
##########
@@ -37,9 +37,10 @@
 
 class SparkFilesScan extends SparkBatchScan {
   private final String taskSetID;
-  private final long splitSize;
-  private final int splitLookback;
-  private final long splitOpenFileCost;
+  private final Long readSplitSize;
+  private final Long planTargetSize;
+  private final Integer splitLookback;

Review comment:
       Ah I see I was just matching all the other variables in this section 
being objects, i'll change them back though 

##########
File path: 
spark3/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java
##########
@@ -61,9 +61,15 @@ public Table table() {
       SparkSession cloneSession = spark.cloneSession();
       cloneSession.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(), 
false);
 
+      long targetReadSize = splitSize(inputFileSize(filesToRewrite));
+      // Ideally this would be the row-group size but the row group size is 
not guaranteed to be consistent

Review comment:
       Since we only have rowGroup size set for Parquet I'm going to check for 
the value being set, and otherwise do the plan size over 4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3292: Spark: Compact Medium Size Files (#460)

Reply via email to