[GitHub] [iceberg] shardulm94 commented on a change in pull request #2248: Spark: Fix vectorization flags

GitBox Mon, 22 Feb 2021 11:52:00 -0800


shardulm94 commented on a change in pull request #2248:
URL: https://github.com/apache/iceberg/pull/2248#discussion_r580536744




##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -474,15 +475,31 @@ public static boolean isLocalityEnabled(FileIO io, String 
location, CaseInsensit
     return false;
   }
 
-  public static boolean isVectorizationEnabled(Map<String, String> properties, 
CaseInsensitiveStringMap readOptions) {
+  public static boolean isVectorizationEnabled(FileFormat fileFormat,
+                                               Map<String, String> properties,
+                                               CaseInsensitiveStringMap 
readOptions) {
     String batchReadsSessionConf = SparkSession.active().conf()
         .get("spark.sql.iceberg.vectorization.enabled", null);
     if (batchReadsSessionConf != null) {
       return Boolean.valueOf(batchReadsSessionConf);
     }
-    return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED,

Review comment:
       I see tradeoffs either way. I agree that the most specific value is 
ideally the read options explicitly passed to the table read. But a session 
conf taking higher precedence is also convenient in production to turn off 
vectorization for an application by a pure config change, no need for code 
changes.
   
   Another option we have is to use a boolean `AND` between the session conf 
and read option. This is used in 
https://github.com/apache/iceberg/blob/91ac42174e4c535ece4e36db2cb587a23babced9/spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L182
   It can be a little confusing here if the default of session conf (true) is 
different than the default of read option (false), but is worth considering.

##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -338,12 +330,50 @@ public boolean enableBatchRead() {
 
       boolean hasNoDeleteFiles = 
tasks().stream().noneMatch(TableScanUtil::hasDeletes);
 
+      boolean batchReadsEnabled = batchReadsEnabled(allParquetFileScanTasks, 
allOrcFileScanTasks);
+
       this.readUsingBatch = batchReadsEnabled && hasNoDeleteFiles && 
(allOrcFileScanTasks ||
           (allParquetFileScanTasks && atLeastOneColumn && onlyPrimitives));
     }
     return readUsingBatch;
   }
 
+  private boolean batchReadsEnabled(boolean isParquetOnly, boolean isOrcOnly) {
+    if (isParquetOnly) {
+      return isVectorizationEnabled(FileFormat.PARQUET);
+    } else if (isOrcOnly) {
+      return isVectorizationEnabled(FileFormat.ORC);
+    } else {
+      return false;
+    }
+  }
+
+  public boolean isVectorizationEnabled(FileFormat fileFormat) {

Review comment:
       Agreed, it may also be good to factor out Iceberg session confs into a 
class of its own along with the defaults. We have three right now.
   ```
   spark.sql.iceberg.vectorization.enabled
   spark.sql.iceberg.check-ordering
   spark.sql.iceberg.check-nullability
   ```
   
   Also probably worth adding these configs to the documentation




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] shardulm94 commented on a change in pull request #2248: Spark: Fix vectorization flags

Reply via email to