Re: [PR] feat: Add specific configs for converting Spark Parquet and JSON data to Arrow [datafusion-comet]

via GitHub Thu, 15 Aug 2024 10:14:14 -0700


andygrove commented on code in PR #832:
URL: https://github.com/apache/datafusion-comet/pull/832#discussion_r1718733137



##########
common/src/main/scala/org/apache/comet/CometConf.scala:
##########
@@ -84,15 +84,33 @@ object CometConf extends ShimCometConf {
     .booleanConf
     .createWithDefault(sys.env.getOrElse("ENABLE_COMET", "true").toBoolean)
 
-  val COMET_SCAN_ENABLED: ConfigEntry[Boolean] = 
conf("spark.comet.scan.enabled")
+  val COMET_NATIVE_SCAN_ENABLED: ConfigEntry[Boolean] = 
conf("spark.comet.scan.enabled")
     .doc(
-      "Whether to enable Comet scan. When this is turned on, Spark will use 
Comet to read " +
-        "Parquet data source. Note that to enable native vectorized execution, 
both this " +
-        "config and 'spark.comet.exec.enabled' need to be enabled. By default, 
this config " +
-        "is true.")
+      "Whether to enable native scans. When this is turned on, Spark will use 
Comet to " +
+        "read supported data sources (currently only Parquet is supported 
natively). Note " +
+        "that to enable native vectorized execution, both this config and " +
+        "'spark.comet.exec.enabled' need to be enabled. By default, this 
config is true.")
     .booleanConf
     .createWithDefault(true)
 
+  val COMET_CONVERT_FROM_PARQUET_ENABLED: ConfigEntry[Boolean] =
+    conf("spark.comet.convert.parquet.enabled")
+      .doc(
+        "When enabled, data from Parquet v1 and v2 scans will be converted to 
Arrow format. Note " +
+          "that to enable native vectorized execution, both this config and " +
+          "'spark.comet.exec.enabled' need to be enabled.")
+      .booleanConf
+      .createWithDefault(false)

Review Comment:
   I added this in the docs:
   
   ```
   ## Parquet
   
    When `spark.comet.scan.enabled` is enabled, Parquet scans will be performed 
natively by Comet if all data types
    in the schema are supported. When this option is not enabled, the scan will 
fall back to Spark. In this case,
    enabling `spark.comet.convert.parquet.enabled` will immediately convert the 
data into Arrow format, allowing native 
    execution to happen after that, but the process may not be efficient.
    ```
    
    I'll take another pass at the config description though to make it more 
detailed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Add specific configs for converting Spark Parquet and JSON data to Arrow [datafusion-comet]

Reply via email to