shardulm94 commented on a change in pull request #2248:
URL: https://github.com/apache/iceberg/pull/2248#discussion_r580536744
##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -474,15 +475,31 @@ public static boolean isLocalityEnabled(FileIO io, String
location, CaseInsensit
return false;
}
- public static boolean isVectorizationEnabled(Map<String, String> properties,
CaseInsensitiveStringMap readOptions) {
+ public static boolean isVectorizationEnabled(FileFormat fileFormat,
+ Map<String, String> properties,
+ CaseInsensitiveStringMap
readOptions) {
String batchReadsSessionConf = SparkSession.active().conf()
.get("spark.sql.iceberg.vectorization.enabled", null);
if (batchReadsSessionConf != null) {
return Boolean.valueOf(batchReadsSessionConf);
}
- return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED,
Review comment:
I see tradeoffs either way. I agree that the most specific value is
ideally the read options explicitly passed to the table read. But a session
conf taking higher precedence is also convenient in production to turn off
vectorization for an application by a pure config change, no need for code
changes.
Another option we have is to use a boolean `AND` between the session conf
and read option. This is used in
https://github.com/apache/iceberg/blob/91ac42174e4c535ece4e36db2cb587a23babced9/spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L182
It can be a little confusing here if the default of session conf (true) is
different than the default of read option (false), but is worth considering.
##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -338,12 +330,50 @@ public boolean enableBatchRead() {
boolean hasNoDeleteFiles =
tasks().stream().noneMatch(TableScanUtil::hasDeletes);
+ boolean batchReadsEnabled = batchReadsEnabled(allParquetFileScanTasks,
allOrcFileScanTasks);
+
this.readUsingBatch = batchReadsEnabled && hasNoDeleteFiles &&
(allOrcFileScanTasks ||
(allParquetFileScanTasks && atLeastOneColumn && onlyPrimitives));
}
return readUsingBatch;
}
+ private boolean batchReadsEnabled(boolean isParquetOnly, boolean isOrcOnly) {
+ if (isParquetOnly) {
+ return isVectorizationEnabled(FileFormat.PARQUET);
+ } else if (isOrcOnly) {
+ return isVectorizationEnabled(FileFormat.ORC);
+ } else {
+ return false;
+ }
+ }
+
+ public boolean isVectorizationEnabled(FileFormat fileFormat) {
Review comment:
Agreed, it may also be good to factor out Iceberg session confs into a
class of its own along with the defaults. We have three right now.
```
spark.sql.iceberg.vectorization.enabled
spark.sql.iceberg.check-ordering
spark.sql.iceberg.check-nullability
```
Also probably worth adding these configs to the documentation
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]