[GitHub] [iceberg] shardulm94 commented on a change in pull request #2248: Spark: Fix isVectorizationEnabled in Spark3Util

GitBox Wed, 17 Feb 2021 16:02:30 -0800


shardulm94 commented on a change in pull request #2248:
URL: https://github.com/apache/iceberg/pull/2248#discussion_r578028187




##########
File path: spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java
##########
@@ -474,15 +475,28 @@ public static boolean isLocalityEnabled(FileIO io, String 
location, CaseInsensit
     return false;
   }
 
-  public static boolean isVectorizationEnabled(Map<String, String> properties, 
CaseInsensitiveStringMap readOptions) {
+  public static boolean isVectorizationEnabled(FileFormat fileFormat,
+                                               Map<String, String> properties,
+                                               CaseInsensitiveStringMap 
readOptions) {
     String batchReadsSessionConf = SparkSession.active().conf()
         .get("spark.sql.iceberg.vectorization.enabled", null);
     if (batchReadsSessionConf != null) {
       return Boolean.valueOf(batchReadsSessionConf);
     }
-    return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED,
-        PropertyUtil.propertyAsBoolean(properties,
-            TableProperties.PARQUET_VECTORIZATION_ENABLED, 
TableProperties.PARQUET_VECTORIZATION_ENABLED_DEFAULT));
+
+    switch (fileFormat) {
+      case PARQUET:
+        boolean defaultValue = PropertyUtil.propertyAsBoolean(
+            properties,
+            TableProperties.PARQUET_VECTORIZATION_ENABLED,
+            TableProperties.PARQUET_VECTORIZATION_ENABLED_DEFAULT);
+        return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED, 
defaultValue);
+      case ORC:
+        // TODO: support a table property to enable/disable vectorized reads 
in ORC
+        return readOptions.getBoolean(SparkReadOptions.VECTORIZATION_ENABLED, 
true);

Review comment:
       We can add one. My initial plan was to remove the Parquet specific table 
property and just have a generic table property for vectorization, but I just 
forgot about it. At LinkedIn, we just pass it as a datasource option because we 
have another layer above Iceberg where we set that. 
   
   ORC vectorized reader supports all datatypes, so I don't see an issue with 
vectorization being enabled by default. It won't work with delete files, but I 
think we have checks elsewhere for that. We can keep it disabled by default for 
backwards compatibility maybe?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] shardulm94 commented on a change in pull request #2248: Spark: Fix isVectorizationEnabled in Spark3Util

Reply via email to