jackye1995 commented on a change in pull request #3292:
URL: https://github.com/apache/iceberg/pull/3292#discussion_r743363296
##########
File path:
spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkFilesScan.java
##########
@@ -37,7 +37,8 @@
class SparkFilesScan extends SparkBatchScan {
private final String taskSetID;
- private final long splitSize;
+ private final Long readSplitSize;
Review comment:
+1, also why change this to `readSplitSize` as you already give the
other config a different name?
##########
File path:
spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/actions/Spark3BinPackStrategy.java
##########
@@ -61,9 +62,17 @@ public Table table() {
SparkSession cloneSession = spark.cloneSession();
cloneSession.conf().set(SQLConf.ADAPTIVE_EXECUTION_ENABLED().key(),
false);
+ long targetReadSize = splitSize(inputFileSize(filesToRewrite));
+ // Ideally this would be the row-group size but the row group size is
not guaranteed to be consistent
+ long fileSplitSize = Long.valueOf(table.properties().getOrDefault(
+ TableProperties.PARQUET_ROW_GROUP_SIZE_BYTES,
Review comment:
There seem to be no related config for Avro and ORC. I think the best
effort we can do here is to check the table file format, if parquet then we try
to read this config, otherwise directly use the default.
##########
File path:
spark/v3.0/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java
##########
@@ -57,6 +57,9 @@ private SparkReadOptions() {
// Set ID that is used to fetch file scan tasks
public static final String FILE_SCAN_TASK_SET_ID = "file-scan-task-set-id";
+ // Set the target task size of a file scan combined tasks
+ public static final String FILE_SCAN_TARGET_SIZE = "file-scan-target-size";
Review comment:
file-scan-plan-tasks-split-size? maybe too long...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]