[GitHub] [iceberg] rdblue commented on a change in pull request #828: iceberg-spark changes for vectorized reads

GitBox Wed, 03 Jun 2020 16:34:36 -0700


rdblue commented on a change in pull request #828:
URL: https://github.com/apache/iceberg/pull/828#discussion_r434906760




##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -145,6 +154,14 @@
     this.io = io;
     this.encryptionManager = encryptionManager;
     this.caseSensitive = caseSensitive;
+
+    boolean enableBatchReadsConfig =
+        
options.get("iceberg.read.parquet-vectorization.enabled").map(Boolean::parseBoolean).orElse(true);
+    if (!enableBatchReadsConfig) {
+      enableBatchRead = Boolean.FALSE;
+    }
+    Optional<String> numRecordsPerBatchOpt = 
options.get("iceberg.read.parquet-vectorization.batch-size");

Review comment:
       Options passed into the read here should be short options because they 
come from the DataFrameReader interface:
   
   ```scala
   spark.read.format("iceberg").config("snapshot-id", snapId).load("db.table");
   ```
   
   The long option names like the ones you have here are what we use for table 
properties, which are tracked in `TableProperties` and documented. These 
options should default to the table property value, but a DataFrameReader 
option should override.
   
   For short names, how about `vectorization-enabled` and `batch-size`? We 
should also add constants in `TableProperties` for the properties you have 
here. Let's turn off vectorization by default.
   
   Also, we can drop "iceberg" from the start of the properties.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #828: iceberg-spark changes for vectorized reads

Reply via email to