(datafusion-comet) branch comet-parquet-exec updated: chore: [comet-parquet-exec] enable native scan by default (again) (#1302)

agrove Tue, 21 Jan 2025 10:04:41 -0800

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch comet-parquet-exec
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git



The following commit(s) were added to refs/heads/comet-parquet-exec by this 
push:
     new facda24b9 chore: [comet-parquet-exec] enable native scan by default 
(again) (#1302)
facda24b9 is described below

commit facda24b9107089c6f4f18e397bd11f20fdbe9ad
Author: Andy Grove <agr...@apache.org>
AuthorDate: Tue Jan 21 11:02:30 2025 -0700

    chore: [comet-parquet-exec] enable native scan by default (again) (#1302)
    
    * fix regression
    
    * enable native scan by default
    
    * experiment
    
    * Revert "experiment"
    
    This reverts commit e05a625a2afa1d0212a58ce9ffb00cab8a679c0d.
    
    * revert change to exception handling
---
 common/src/main/scala/org/apache/comet/CometConf.scala   |  2 +-
 docs/source/user-guide/configs.md                        |  2 +-
 .../apache/comet/parquet/CometParquetFileFormat.scala    | 16 ++++++++++++++--
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/common/src/main/scala/org/apache/comet/CometConf.scala 
b/common/src/main/scala/org/apache/comet/CometConf.scala
index cdfeb5771..2e64403c6 100644
--- a/common/src/main/scala/org/apache/comet/CometConf.scala
+++ b/common/src/main/scala/org/apache/comet/CometConf.scala
@@ -75,7 +75,7 @@ object CometConf extends ShimCometConf {
         "that to enable native vectorized execution, both this config and " +
         "'spark.comet.exec.enabled' need to be enabled.")
     .booleanConf
-    .createWithDefault(false)
+    .createWithDefault(true)
 
   val SCAN_NATIVE_COMET = "native_comet"
   val SCAN_NATIVE_DATAFUSION = "native_datafusion"
diff --git a/docs/source/user-guide/configs.md 
b/docs/source/user-guide/configs.md
index 6322d4bf6..2bea501e5 100644
--- a/docs/source/user-guide/configs.md
+++ b/docs/source/user-guide/configs.md
@@ -74,7 +74,7 @@ Comet provides the following configuration settings.
 | spark.comet.parquet.read.parallel.io.enabled | Whether to enable Comet's 
parallel reader for Parquet files. The parallel reader reads ranges of 
consecutive data in a  file in parallel. It is faster for large files and row 
groups but uses more resources. | true |
 | spark.comet.parquet.read.parallel.io.thread-pool.size | The maximum number 
of parallel threads the parallel reader will use in a single executor. For 
executors configured with a smaller number of cores, use a smaller number. | 16 
|
 | spark.comet.regexp.allowIncompatible | Comet is not currently fully 
compatible with Spark for all regular expressions. Set this config to true to 
allow them anyway using Rust's regular expression engine. See compatibility 
guide for more information. | false |
-| spark.comet.scan.enabled | Whether to enable native scans. When this is 
turned on, Spark will use Comet to read supported data sources (currently only 
Parquet is supported natively). Note that to enable native vectorized 
execution, both this config and 'spark.comet.exec.enabled' need to be enabled. 
| false |
+| spark.comet.scan.enabled | Whether to enable native scans. When this is 
turned on, Spark will use Comet to read supported data sources (currently only 
Parquet is supported natively). Note that to enable native vectorized 
execution, both this config and 'spark.comet.exec.enabled' need to be enabled. 
| true |
 | spark.comet.scan.preFetch.enabled | Whether to enable pre-fetching feature 
of CometScan. | false |
 | spark.comet.scan.preFetch.threadNum | The number of threads running 
pre-fetching for CometScan. Effective if spark.comet.scan.preFetch.enabled is 
enabled. Note that more pre-fetching threads means more memory requirement to 
store pre-fetched row groups. | 2 |
 | spark.comet.shuffle.preferDictionary.ratio | The ratio of total values to 
distinct values in a string column to decide whether to prefer dictionary 
encoding when shuffling the column. If the ratio is higher than this config, 
dictionary encoding will be used on shuffling string column. This config is 
effective if it is higher than 1.0. Note that this config is only used when 
`spark.comet.exec.shuffle.mode` is `jvm`. | 10.0 |
diff --git 
a/spark/src/main/scala/org/apache/comet/parquet/CometParquetFileFormat.scala 
b/spark/src/main/scala/org/apache/comet/parquet/CometParquetFileFormat.scala
index b6a511a5b..b97aff1b1 100644
--- a/spark/src/main/scala/org/apache/comet/parquet/CometParquetFileFormat.scala
+++ b/spark/src/main/scala/org/apache/comet/parquet/CometParquetFileFormat.scala
@@ -151,7 +151,13 @@ class CometParquetFileFormat extends ParquetFileFormat 
with MetricsSupport with
             partitionSchema,
             file.partitionValues,
             JavaConverters.mapAsJavaMap(metrics))
-          batchReader.init()
+          try {
+            batchReader.init()
+          } catch {
+            case e: Throwable =>
+              batchReader.close()
+              throw e
+          }
           batchReader
         } else {
           val batchReader = new BatchReader(
@@ -167,7 +173,13 @@ class CometParquetFileFormat extends ParquetFileFormat 
with MetricsSupport with
             partitionSchema,
             file.partitionValues,
             JavaConverters.mapAsJavaMap(metrics))
-          batchReader.init()
+          try {
+            batchReader.init()
+          } catch {
+            case e: Throwable =>
+              batchReader.close()
+              throw e
+          }
           batchReader
         }
       val iter = new RecordReaderIterator(recordBatchReader)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-comet) branch comet-parquet-exec updated: chore: [comet-parquet-exec] enable native scan by default (again) (#1302)

Reply via email to