(datafusion-comet) branch comet-parquet-exec updated: chore: [comet-parquet-exec] merge from main 20240116 (#1299)

agrove Sat, 18 Jan 2025 09:15:57 -0800

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch comet-parquet-exec
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git



The following commit(s) were added to refs/heads/comet-parquet-exec by this 
push:
     new c17a0f607 chore: [comet-parquet-exec] merge from main 20240116 (#1299)
c17a0f607 is described below

commit c17a0f607d52164b57033a4cc054a1b5240a0d73
Author: Parth Chandra <[email protected]>
AuthorDate: Sat Jan 18 08:02:28 2025 -0800

    chore: [comet-parquet-exec] merge from main 20240116 (#1299)
    
    * feat: support array_append (#1072)
    
    * feat: support array_append
    
    * formatted code
    
    * rewrite array_append plan to match spark behaviour and fixed bug in 
QueryPlan serde
    
    * remove unwrap
    
    * Fix for Spark 3.3
    
    * refactor array_append binary expression serde code
    
    * Disabled array_append test for spark 4.0+
    
    * chore: Simplify CometShuffleMemoryAllocator to use Spark unified memory 
allocator (#1063)
    
    * docs: Update benchmarking.md (#1085)
    
    * feat: Require offHeap memory to be enabled (always use unified memory) 
(#1062)
    
    * Require offHeap memory
    
    * remove unused import
    
    * use off heap memory in stability tests
    
    * reorder imports
    
    * test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE 
config (#1087)
    
    * Add changelog for 0.4.0 (#1089)
    
    * chore: Prepare for 0.5.0 development (#1090)
    
    * Update version number for build
    
    * update docs
    
    * build: Skip installation of spark-integration  and fuzz testing modules 
(#1091)
    
    * Add hint for finding the GPG key to use when publishing to maven (#1093)
    
    * docs: Update documentation for 0.4.0 release (#1096)
    
    * update TPC-H results
    
    * update Maven links
    
    * update benchmarking guide and add TPC-DS results
    
    * include q72
    
    * fix: Unsigned type related bugs (#1095)
    
    ## Which issue does this PR close?
    
    Closes https://github.com/apache/datafusion-comet/issues/1067
    
    ## Rationale for this change
    
    Bug fix. A few expressions were failing some unsigned type related tests
    
    ## What changes are included in this PR?
    
     - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to 
copy full i16/i32 width instead of padding zeros in the higher bits
     - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()`  (`>` vs 
`>=`)
    
    ## How are these changes tested?
    
    Put back tests for unsigned types
    
    * chore: Include first ScanExec batch in metrics (#1105)
    
    * include first batch in ScanExec metrics
    
    * record row count metric
    
    * fix regression
    
    * chore: Improve CometScan metrics (#1100)
    
    * Add native metrics for plan creation
    
    * make messages consistent
    
    * Include get_next_batch cost in metrics
    
    * formatting
    
    * fix double count of rows
    
    * chore: Add custom metric for native shuffle fetching batches from JVM 
(#1108)
    
    * feat: support array_insert (#1073)
    
    * Part of the implementation of array_insert
    
    * Missing methods
    
    * Working version
    
    * Reformat code
    
    * Fix code-style
    
    * Add comments about spark's implementation.
    
    * Implement negative indices
    
    + fix tests for spark < 3.4
    
    * Fix code-style
    
    * Fix scalastyle
    
    * Fix tests for spark < 3.4
    
    * Fixes & tests
    
    - added test for the negative index
    - added test for the legacy spark mode
    
    * Use assume(isSpark34Plus) in tests
    
    * Test else-branch & improve coverage
    
    * Update native/spark-expr/src/list.rs
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * Fix fallback test
    
    In one case there is a zero in index and test fails due to spark error
    
    * Adjust the behaviour for the NULL case to Spark
    
    * Move the logic of type checking to the method
    
    * Fix code-style
    
    ---------
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * feat: enable decimal to decimal cast of different precision and scale 
(#1086)
    
    * enable decimal to decimal cast of different precision and scale
    
    * add more test cases for negative scale and higher precision
    
    * add check for compatibility for decimal to decimal
    
    * fix code style
    
    * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * fix the nit in comment
    
    ---------
    
    Co-authored-by: himadripal <[email protected]>
    Co-authored-by: Andy Grove <[email protected]>
    
    * docs: fix readme FGPA/FPGA typo (#1117)
    
    * fix: Use RDD partition index (#1112)
    
    * fix: Use RDD partition index
    
    * fix
    
    * fix
    
    * fix
    
    * fix: Various metrics bug fixes and improvements (#1111)
    
    * fix: Don't create CometScanExec for subclasses of ParquetFileFormat 
(#1129)
    
    * Use exact class comparison for parquet scan
    
    * Add test
    
    * Add comment
    
    * fix: Fix metrics regressions (#1132)
    
    * fix metrics issues
    
    * clippy
    
    * update tests
    
    * docs: Add more technical detail and new diagram to Comet plugin overview 
(#1119)
    
    * Add more technical detail and new diagram to Comet plugin overview
    
    * update diagram
    
    * add info on Arrow IPC
    
    * update diagram
    
    * update diagram
    
    * update docs
    
    * address feedback
    
    * Stop passing Java config map into native createPlan (#1101)
    
    * feat: Improve ScanExec native metrics (#1133)
    
    * save
    
    * remove shuffle jvm metric and update tuning guide
    
    * docs
    
    * add source for all ScanExecs
    
    * address feedback
    
    * address feedback
    
    * chore: Remove unused StringView struct (#1143)
    
    * Remove unused StringView struct
    
    * remove more dead code
    
    * docs: Add some documentation explaining how shuffle works (#1148)
    
    * add some notes on shuffle
    
    * reads
    
    * improve docs
    
    * test: enable more Spark 4.0 tests (#1145)
    
    ## Which issue does this PR close?
    
    Part of https://github.com/apache/datafusion-comet/issues/372 and 
https://github.com/apache/datafusion-comet/issues/551
    
    ## Rationale for this change
    
    To be ready for Spark 4.0
    
    ## What changes are included in this PR?
    
    This PR enables more Spark 4.0 tests that were fixed by recent changes
    
    ## How are these changes tested?
    
    tests enabled
    
    * chore: Refactor cast to use SparkCastOptions param (#1146)
    
    * Refactor cast to use SparkCastOptions param
    
    * update tests
    
    * update benches
    
    * update benches
    
    * update benches
    
    * Enable more scenarios in CometExecBenchmark. (#1151)
    
    * chore: Move more expressions from core crate to spark-expr crate (#1152)
    
    * move aggregate expressions to spark-expr crate
    
    * move more expressions
    
    * move benchmark
    
    * normalize_nan
    
    * bitwise not
    
    * comet scalar funcs
    
    * update bench imports
    
    * remove dead code (#1155)
    
    * fix: Spark 4.0-preview1 SPARK-47120 (#1156)
    
    ## Which issue does this PR close?
    
    Part of https://github.com/apache/datafusion-comet/issues/372 and 
https://github.com/apache/datafusion-comet/issues/551
    
    ## Rationale for this change
    
    To be ready for Spark 4.0
    
    ## What changes are included in this PR?
    
    This PR fixes the new test SPARK-47120 added in Spark 4.0
    
    ## How are these changes tested?
    
    tests enabled
    
    * chore: Move string kernels and expressions to spark-expr crate (#1164)
    
    * Move string kernels and expressions to spark-expr crate
    
    * remove unused hash kernel
    
    * remove unused dependencies
    
    * chore: Move remaining expressions to spark-expr crate + some minor 
refactoring (#1165)
    
    * move CheckOverflow to spark-expr crate
    
    * move NegativeExpr to spark-expr crate
    
    * move UnboundColumn to spark-expr crate
    
    * move ExpandExec from execution::datafusion::operators to 
execution::operators
    
    * refactoring to remove datafusion subpackage
    
    * update imports in benches
    
    * fix
    
    * fix
    
    * chore: Add ignored tests for reading complex types from Parquet (#1167)
    
    * Add ignored tests for reading structs from Parquet
    
    * add basic map test
    
    * add tests for Map and Array
    
    * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169)
    
    * Add Spark-compatible SchemaAdapterFactory implementation
    
    * remove prototype code
    
    * fix
    
    * refactor
    
    * implement more cast logic
    
    * implement more cast logic
    
    * add basic test
    
    * improve test
    
    * cleanup
    
    * fmt
    
    * add support for casting unsigned int to signed int
    
    * clippy
    
    * address feedback
    
    * fix test
    
    * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176)
    
    * test: enabling Spark tests with offHeap requirement (#1177)
    
    ## Which issue does this PR close?
    
    ## Rationale for this change
    
    After https://github.com/apache/datafusion-comet/pull/1062 We have not 
running Spark tests for native execution
    
    ## What changes are included in this PR?
    
    Removed the off heap requirement for testing
    
    ## How are these changes tested?
    
    Bringing back Spark tests for native execution
    
    * feat: Improve shuffle metrics (second attempt) (#1175)
    
    * improve shuffle metrics
    
    * docs
    
    * more metrics
    
    * refactor
    
    * address feedback
    
    * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184)
    
    * add test
    
    * fix
    
    * fix
    
    * fix
    
    * feat: Make native shuffle compression configurable and respect 
`spark.shuffle.compress` (#1185)
    
    * Make shuffle compression codec and level configurable
    
    * remove lz4 references
    
    * docs
    
    * update comment
    
    * clippy
    
    * fix benches
    
    * clippy
    
    * clippy
    
    * disable test for miri
    
    * remove lz4 reference from proto
    
    * minor: move shuffle classes from common to spark (#1193)
    
    * minor: refactor decodeBatches to make private in broadcast exchange 
(#1195)
    
    * minor: refactor prepare_output so that it does not require an 
ExecutionContext (#1194)
    
    * fix: fix missing explanation for then branch in case when (#1200)
    
    * minor: remove unused source files (#1202)
    
    * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154)
    
    * move aggregate expressions to spark-expr crate
    
    * move more expressions
    
    * move benchmark
    
    * normalize_nan
    
    * bitwise not
    
    * comet scalar funcs
    
    * update bench imports
    
    * save
    
    * save
    
    * save
    
    * remove unused imports
    
    * clippy
    
    * implement more hashers
    
    * implement Hash and PartialEq
    
    * implement Hash and PartialEq
    
    * implement Hash and PartialEq
    
    * benches
    
    * fix ScalarUDFImpl.return_type failure
    
    * exclude test from miri
    
    * ignore correct test
    
    * ignore another test
    
    * remove miri checks
    
    * use return_type_from_exprs
    
    * Revert "use return_type_from_exprs"
    
    This reverts commit febc1f1ec1301f9b359fc23ad6a117224fce35b7.
    
    * use DF main branch
    
    * hacky workaround for regression in ScalarUDFImpl.return_type
    
    * fix repo url
    
    * pin to revision
    
    * bump to latest rev
    
    * bump to latest DF rev
    
    * bump DF to rev 9f530dd
    
    * add Cargo.lock
    
    * bump DF version
    
    * no default features
    
    * Revert "remove miri checks"
    
    This reverts commit 4638fe3aa5501966cd5d8b53acf26c698b10b3c9.
    
    * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930
    
    * update pin
    
    * Update Cargo.toml
    
    Bump to 44.0.0-rc2
    
    * update cargo lock
    
    * revert miri change
    
    ---------
    
    Co-authored-by: Andrew Lamb <[email protected]>
    
    * feat: add support for array_contains expression (#1163)
    
    * feat: add support for array_contains expression
    
    * test: add unit test for array_contains function
    
    * Removes unnecessary case expression for handling null values
    
    * chore: Move more expressions from core crate to spark-expr crate (#1152)
    
    * move aggregate expressions to spark-expr crate
    
    * move more expressions
    
    * move benchmark
    
    * normalize_nan
    
    * bitwise not
    
    * comet scalar funcs
    
    * update bench imports
    
    * remove dead code (#1155)
    
    * fix: Spark 4.0-preview1 SPARK-47120 (#1156)
    
    ## Which issue does this PR close?
    
    Part of https://github.com/apache/datafusion-comet/issues/372 and 
https://github.com/apache/datafusion-comet/issues/551
    
    ## Rationale for this change
    
    To be ready for Spark 4.0
    
    ## What changes are included in this PR?
    
    This PR fixes the new test SPARK-47120 added in Spark 4.0
    
    ## How are these changes tested?
    
    tests enabled
    
    * chore: Move string kernels and expressions to spark-expr crate (#1164)
    
    * Move string kernels and expressions to spark-expr crate
    
    * remove unused hash kernel
    
    * remove unused dependencies
    
    * chore: Move remaining expressions to spark-expr crate + some minor 
refactoring (#1165)
    
    * move CheckOverflow to spark-expr crate
    
    * move NegativeExpr to spark-expr crate
    
    * move UnboundColumn to spark-expr crate
    
    * move ExpandExec from execution::datafusion::operators to 
execution::operators
    
    * refactoring to remove datafusion subpackage
    
    * update imports in benches
    
    * fix
    
    * fix
    
    * chore: Add ignored tests for reading complex types from Parquet (#1167)
    
    * Add ignored tests for reading structs from Parquet
    
    * add basic map test
    
    * add tests for Map and Array
    
    * feat: Add Spark-compatible implementation of SchemaAdapterFactory (#1169)
    
    * Add Spark-compatible SchemaAdapterFactory implementation
    
    * remove prototype code
    
    * fix
    
    * refactor
    
    * implement more cast logic
    
    * implement more cast logic
    
    * add basic test
    
    * improve test
    
    * cleanup
    
    * fmt
    
    * add support for casting unsigned int to signed int
    
    * clippy
    
    * address feedback
    
    * fix test
    
    * fix: Document enabling comet explain plan usage in Spark (4.0) (#1176)
    
    * test: enabling Spark tests with offHeap requirement (#1177)
    
    ## Which issue does this PR close?
    
    ## Rationale for this change
    
    After https://github.com/apache/datafusion-comet/pull/1062 We have not 
running Spark tests for native execution
    
    ## What changes are included in this PR?
    
    Removed the off heap requirement for testing
    
    ## How are these changes tested?
    
    Bringing back Spark tests for native execution
    
    * feat: Improve shuffle metrics (second attempt) (#1175)
    
    * improve shuffle metrics
    
    * docs
    
    * more metrics
    
    * refactor
    
    * address feedback
    
    * fix: stddev_pop should not directly return 0.0 when count is 1.0 (#1184)
    
    * add test
    
    * fix
    
    * fix
    
    * fix
    
    * feat: Make native shuffle compression configurable and respect 
`spark.shuffle.compress` (#1185)
    
    * Make shuffle compression codec and level configurable
    
    * remove lz4 references
    
    * docs
    
    * update comment
    
    * clippy
    
    * fix benches
    
    * clippy
    
    * clippy
    
    * disable test for miri
    
    * remove lz4 reference from proto
    
    * minor: move shuffle classes from common to spark (#1193)
    
    * minor: refactor decodeBatches to make private in broadcast exchange 
(#1195)
    
    * minor: refactor prepare_output so that it does not require an 
ExecutionContext (#1194)
    
    * fix: fix missing explanation for then branch in case when (#1200)
    
    * minor: remove unused source files (#1202)
    
    * chore: Upgrade to DataFusion 44.0.0-rc2 (#1154)
    
    * move aggregate expressions to spark-expr crate
    
    * move more expressions
    
    * move benchmark
    
    * normalize_nan
    
    * bitwise not
    
    * comet scalar funcs
    
    * update bench imports
    
    * save
    
    * save
    
    * save
    
    * remove unused imports
    
    * clippy
    
    * implement more hashers
    
    * implement Hash and PartialEq
    
    * implement Hash and PartialEq
    
    * implement Hash and PartialEq
    
    * benches
    
    * fix ScalarUDFImpl.return_type failure
    
    * exclude test from miri
    
    * ignore correct test
    
    * ignore another test
    
    * remove miri checks
    
    * use return_type_from_exprs
    
    * Revert "use return_type_from_exprs"
    
    This reverts commit febc1f1ec1301f9b359fc23ad6a117224fce35b7.
    
    * use DF main branch
    
    * hacky workaround for regression in ScalarUDFImpl.return_type
    
    * fix repo url
    
    * pin to revision
    
    * bump to latest rev
    
    * bump to latest DF rev
    
    * bump DF to rev 9f530dd
    
    * add Cargo.lock
    
    * bump DF version
    
    * no default features
    
    * Revert "remove miri checks"
    
    This reverts commit 4638fe3aa5501966cd5d8b53acf26c698b10b3c9.
    
    * Update pin to DataFusion e99e02b9b9093ceb0c13a2dd32a2a89beba47930
    
    * update pin
    
    * Update Cargo.toml
    
    Bump to 44.0.0-rc2
    
    * update cargo lock
    
    * revert miri change
    
    ---------
    
    Co-authored-by: Andrew Lamb <[email protected]>
    
    * update UT
    
    Signed-off-by: Dharan Aditya <[email protected]>
    
    * fix typo in UT
    
    Signed-off-by: Dharan Aditya <[email protected]>
    
    ---------
    
    Signed-off-by: Dharan Aditya <[email protected]>
    Co-authored-by: Andy Grove <[email protected]>
    Co-authored-by: KAZUYUKI TANIMURA <[email protected]>
    Co-authored-by: Parth Chandra <[email protected]>
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    Co-authored-by: Raz Luvaton <[email protected]>
    Co-authored-by: Andrew Lamb <[email protected]>
    
    * feat: Add a `spark.comet.exec.memoryPool` configuration for experimenting 
with various datafusion memory pool setups. (#1021)
    
    * feat: Reenable tests for filtered SMJ anti join (#1211)
    
    * feat: reenable filtered SMJ Anti join tests
    
    * feat: reenable filtered SMJ Anti join tests
    
    * feat: reenable filtered SMJ Anti join tests
    
    * feat: reenable filtered SMJ Anti join tests
    
    * Add CoalesceBatchesExec around SMJ with join filter
    
    * adding `CoalesceBatches`
    
    * adding `CoalesceBatches`
    
    * adding `CoalesceBatches`
    
    * feat: reenable filtered SMJ Anti join tests
    
    * feat: reenable filtered SMJ Anti join tests
    
    ---------
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * chore: Add safety check to CometBuffer (#1050)
    
    * chore: Add safety check to CometBuffer
    
    * Add CometColumnarToRowExec
    
    * fix
    
    * fix
    
    * more
    
    * Update plan stability results
    
    * fix
    
    * fix
    
    * fix
    
    * Revert "fix"
    
    This reverts commit 9bad173c7751f105bf3ded2ebc2fed0737d1b909.
    
    * Revert "Revert "fix""
    
    This reverts commit d527ad1a365d3aff64200ceba6d11cf376f3919f.
    
    * fix BucketedReadWithoutHiveSupportSuite
    
    * fix SparkPlanSuite
    
    * remove unreachable code (#1213)
    
    * test: Enable Comet by default except some tests in 
SparkSessionExtensionSuite (#1201)
    
    ## Which issue does this PR close?
    
    Part of https://github.com/apache/datafusion-comet/issues/1197
    
    ## Rationale for this change
    
    Since `loadCometExtension` in the diffs were not using `isCometEnabled`, 
`SparkSessionExtensionSuite` was not using Comet. Once enabled, some test 
failures discovered
    
    ## What changes are included in this PR?
    
    `loadCometExtension` now uses `isCometEnabled` that enables Comet by default
    Temporary ignore the failing tests in SparkSessionExtensionSuite
    
    ## How are these changes tested?
    
    existing tests
    
    * extract struct expressions to folders based on spark grouping (#1216)
    
    * chore: extract static invoke expressions to folders based on spark 
grouping (#1217)
    
    * extract static invoke expressions to folders based on spark grouping
    
    * Update native/spark-expr/src/static_invoke/mod.rs
    
    Co-authored-by: Andy Grove <[email protected]>
    
    ---------
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * chore: Follow-on PR to fully enable onheap memory usage (#1210)
    
    * Make datafusion's native memory pool configurable
    
    * save
    
    * fix
    
    * Update memory calculation and add draft documentation
    
    * ready for review
    
    * ready for review
    
    * address feedback
    
    * Update docs/source/user-guide/tuning.md
    
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    
    * Update docs/source/user-guide/tuning.md
    
    Co-authored-by: Kristin Cowalcijk <[email protected]>
    
    * Update docs/source/user-guide/tuning.md
    
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    
    * Update docs/source/user-guide/tuning.md
    
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    
    * remove unused config
    
    ---------
    
    Co-authored-by: Kristin Cowalcijk <[email protected]>
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    
    * feat: Move shuffle block decompression and decoding to native code and 
add LZ4 & Snappy support (#1192)
    
    * Implement native decoding and decompression
    
    * revert some variable renaming for smaller diff
    
    * fix oom issues?
    
    * make NativeBatchDecoderIterator more consistent with ArrowReaderIterator
    
    * fix oom and prep for review
    
    * format
    
    * Add LZ4 support
    
    * clippy, new benchmark
    
    * rename metrics, clean up lz4 code
    
    * update test
    
    * Add support for snappy
    
    * format
    
    * change default back to lz4
    
    * make metrics more accurate
    
    * format
    
    * clippy
    
    * use faster unsafe version of lz4_flex
    
    * Make compression codec configurable for columnar shuffle
    
    * clippy
    
    * fix bench
    
    * fmt
    
    * address feedback
    
    * address feedback
    
    * address feedback
    
    * minor code simplification
    
    * cargo fmt
    
    * overflow check
    
    * rename compression level config
    
    * address feedback
    
    * address feedback
    
    * rename constant
    
    * chore: extract agg_funcs expressions to folders based on spark grouping 
(#1224)
    
    * extract agg_funcs expressions to folders based on spark grouping
    
    * fix rebase
    
    * extract datetime_funcs expressions to folders based on spark grouping 
(#1222)
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * chore: use datafusion from crates.io (#1232)
    
    * chore: extract strings file to `strings_func` like in spark grouping 
(#1215)
    
    * chore: extract predicate_functions expressions to folders based on spark 
grouping (#1218)
    
    * extract predicate_functions expressions to folders based on spark grouping
    
    * code review changes
    
    ---------
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * build(deps): bump protobuf version to 3.21.12 (#1234)
    
    * extract json_funcs expressions to folders based on spark grouping (#1220)
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * test: Enable shuffle by default in Spark tests (#1240)
    
    ## Which issue does this PR close?
    
    ## Rationale for this change
    
    Because `isCometShuffleEnabled` is false by default, some tests were not 
reached
    
    ## What changes are included in this PR?
    
    Removed `isCometShuffleEnabled` and updated spark test diff
    
    ## How are these changes tested?
    
    existing test
    
    * chore: extract hash_funcs expressions to folders based on spark grouping 
(#1221)
    
    * extract hash_funcs expressions to folders based on spark grouping
    
    * extract hash_funcs expressions to folders based on spark grouping
    
    ---------
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * fix: Fall back to Spark for unsupported partition or sort expressions in 
window aggregates (#1253)
    
    * perf: Improve query planning to more reliably fall back to columnar 
shuffle when native shuffle is not supported (#1209)
    
    * fix regression (#1259)
    
    * feat: add support for array_remove expression (#1179)
    
    * wip: array remove
    
    * added comet expression test
    
    * updated test cases
    
    * fixed array_remove function for null values
    
    * removed commented code
    
    * remove unnecessary code
    
    * updated the test for 'array_remove'
    
    * added test for array_remove in case the input array is null
    
    * wip: case array is empty
    
    * removed test case for empty array
    
    * fix: Fall back to Spark for distinct aggregates (#1262)
    
    * fall back to Spark for distinct aggregates
    
    * update expected plans for 3.4
    
    * update expected plans for 3.5
    
    * force build
    
    * add comment
    
    * feat: Implement custom RecordBatch serde for shuffle for improved 
performance (#1190)
    
    * Implement faster encoder for shuffle blocks
    
    * make code more concise
    
    * enable fast encoding for columnar shuffle
    
    * update benches
    
    * test all int types
    
    * test float
    
    * remaining types
    
    * add Snappy and Zstd(6) back to benchmark
    
    * fix regression
    
    * Update native/core/src/execution/shuffle/codec.rs
    
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    
    * address feedback
    
    * support nullable flag
    
    ---------
    
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    
    * docs: Update TPC-H benchmark results (#1257)
    
    * fix: disable initCap by default (#1276)
    
    * fix: disable initCap by default
    
    * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * address review comments
    
    ---------
    
    Co-authored-by: Andy Grove <[email protected]>
    
    * chore: Add changelog for 0.5.0 (#1278)
    
    * Add changelog
    
    * revert accidental change
    
    * move 2 items to performance section
    
    * update TPC-DS results for 0.5.0 (#1277)
    
    * fix: cast timestamp to decimal is unsupported (#1281)
    
    * fix: cast timestamp to decimal is unsupported
    
    * fix style
    
    * revert test name and mark as ignore
    
    * add comment
    
    * chore: Start 0.6.0 development (#1286)
    
    * start 0.6.0 development
    
    * update some docs
    
    * Revert a change
    
    * update CI
    
    * docs: Fix links and provide complete benchmarking scripts (#1284)
    
    * fix links and provide complete scripts
    
    * fix path
    
    * fix incorrect text
    
    * feat: Add HasRowIdMapping interface (#1288)
    
    * fix style
    
    * fix
    
    * fix for plan serialization
    
    ---------
    
    Signed-off-by: Dharan Aditya <[email protected]>
    Co-authored-by: NoeB <[email protected]>
    Co-authored-by: Liang-Chi Hsieh <[email protected]>
    Co-authored-by: Raz Luvaton <[email protected]>
    Co-authored-by: Andy Grove <[email protected]>
    Co-authored-by: KAZUYUKI TANIMURA <[email protected]>
    Co-authored-by: Sem <[email protected]>
    Co-authored-by: Himadri Pal <[email protected]>
    Co-authored-by: himadripal <[email protected]>
    Co-authored-by: gstvg <[email protected]>
    Co-authored-by: Adam Binford <[email protected]>
    Co-authored-by: Matt Butrovich <[email protected]>
    Co-authored-by: Raz Luvaton <[email protected]>
    Co-authored-by: Andrew Lamb <[email protected]>
    Co-authored-by: Dharan Aditya <[email protected]>
    Co-authored-by: Kristin Cowalcijk <[email protected]>
    Co-authored-by: Oleks V <[email protected]>
    Co-authored-by: Zhen Wang <[email protected]>
    Co-authored-by: Jagdish Parihar <[email protected]>
---
 .github/workflows/spark_sql_test.yml               |  2 +-
 .github/workflows/spark_sql_test_ansi.yml          |  2 +-
 common/pom.xml                                     |  2 +-
 .../org/apache/comet/vector/HasRowIdMapping.java   | 39 ++++++++++++
 .../contributor-guide/benchmark-results/tpc-ds.md  | 65 +++++++++++++++++++-
 .../contributor-guide/benchmark-results/tpc-h.md   | 71 ++++++++++++++++++++--
 docs/source/contributor-guide/benchmarking.md      | 56 -----------------
 docs/source/contributor-guide/debugging.md         |  2 +-
 docs/source/user-guide/installation.md             |  4 +-
 fuzz-testing/pom.xml                               |  2 +-
 native/Cargo.lock                                  |  6 +-
 native/Cargo.toml                                  |  6 +-
 pom.xml                                            |  2 +-
 spark-integration/pom.xml                          |  2 +-
 spark/pom.xml                                      |  2 +-
 .../apache/comet/CometSparkSessionExtensions.scala | 15 ++++-
 16 files changed, 197 insertions(+), 81 deletions(-)

diff --git a/.github/workflows/spark_sql_test.yml 
b/.github/workflows/spark_sql_test.yml
index 477e3a1ab..238fbb271 100644
--- a/.github/workflows/spark_sql_test.yml
+++ b/.github/workflows/spark_sql_test.yml
@@ -71,7 +71,7 @@ jobs:
         with:
           spark-version: ${{ matrix.spark-version.full }}
           spark-short-version: ${{ matrix.spark-version.short }}
-          comet-version: '0.5.0-SNAPSHOT' # TODO: get this from pom.xml
+          comet-version: '0.6.0-SNAPSHOT' # TODO: get this from pom.xml
       - name: Run Spark tests
         run: |
           cd apache-spark
diff --git a/.github/workflows/spark_sql_test_ansi.yml 
b/.github/workflows/spark_sql_test_ansi.yml
index e1d8388fb..14ec6366f 100644
--- a/.github/workflows/spark_sql_test_ansi.yml
+++ b/.github/workflows/spark_sql_test_ansi.yml
@@ -69,7 +69,7 @@ jobs:
         with:
           spark-version: ${{ matrix.spark-version.full }}
           spark-short-version: ${{ matrix.spark-version.short }}
-          comet-version: '0.5.0-SNAPSHOT' # TODO: get this from pom.xml
+          comet-version: '0.6.0-SNAPSHOT' # TODO: get this from pom.xml
       - name: Run Spark tests
         run: |
           cd apache-spark
diff --git a/common/pom.xml b/common/pom.xml
index 91109edf5..b6cd75a32 100644
--- a/common/pom.xml
+++ b/common/pom.xml
@@ -26,7 +26,7 @@ under the License.
   <parent>
     <groupId>org.apache.datafusion</groupId>
     
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
-    <version>0.5.0-SNAPSHOT</version>
+    <version>0.6.0-SNAPSHOT</version>
     <relativePath>../pom.xml</relativePath>
   </parent>
 
diff --git a/common/src/main/java/org/apache/comet/vector/HasRowIdMapping.java 
b/common/src/main/java/org/apache/comet/vector/HasRowIdMapping.java
new file mode 100644
index 000000000..8794902b4
--- /dev/null
+++ b/common/src/main/java/org/apache/comet/vector/HasRowIdMapping.java
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.comet.vector;
+
+/**
+ * An interface could be implemented by vectors that have row id mapping.
+ *
+ * <p>For example, Iceberg's DeleteFile has a row id mapping to map row id to 
position. This
+ * interface is used to set and get the row id mapping. The row id mapping is 
an array of integers,
+ * where the index is the row id and the value is the position. Here is an 
example:
+ * [0,1,2,3,4,5,6,7] -- Original status of the row id mapping array Position 
delete 2, 6
+ * [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num records to 6]
+ */
+public interface HasRowIdMapping {
+  default void setRowIdMapping(int[] rowIdMapping) {
+    throw new UnsupportedOperationException("setRowIdMapping is not 
supported");
+  }
+
+  default int[] getRowIdMapping() {
+    throw new UnsupportedOperationException("getRowIdMapping is not 
supported");
+  }
+}
diff --git a/docs/source/contributor-guide/benchmark-results/tpc-ds.md 
b/docs/source/contributor-guide/benchmark-results/tpc-ds.md
index a6650f7e7..012913189 100644
--- a/docs/source/contributor-guide/benchmark-results/tpc-ds.md
+++ b/docs/source/contributor-guide/benchmark-results/tpc-ds.md
@@ -19,8 +19,8 @@ under the License.
 
 # Apache DataFusion Comet: Benchmarks Derived From TPC-DS
 
-The following benchmarks were performed on a two node Kubernetes cluster with
-data stored locally in Parquet format on NVMe storage. Performance 
characteristics will vary in different environments 
+The following benchmarks were performed on a Linux workstation with PCIe 5, 
AMD 7950X CPU (16 cores), 128 GB RAM, and
+data stored locally in Parquet format on NVMe storage. Performance 
characteristics will vary in different environments
 and we encourage you to run these benchmarks in your own environments.
 
 The tracking issue for improving TPC-DS performance is 
[#858](https://github.com/apache/datafusion-comet/issues/858).
@@ -43,3 +43,64 @@ The raw results of these benchmarks in JSON format is 
available here:
 
 - [Spark](0.5.0/spark-tpcds.json)
 - [Comet](0.5.0/comet-tpcds.json)
+
+# Scripts
+
+Here are the scripts that were used to generate these results.
+
+## Apache Spark
+
+```shell
+#!/bin/bash
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.memory=32G \
+    --conf spark.executor.instances=2 \
+    --conf spark.executor.cores=8 \
+    --conf spark.cores.max=16 \
+    --conf spark.eventLog.enabled=true \
+    tpcbench.py \
+    --benchmark tpcds \
+    --name spark \
+    --data /mnt/bigdata/tpcds/sf100/ \
+    --queries ../../tpcds/ \
+    --output . \
+    --iterations 5
+```
+
+## Apache Spark + Comet
+
+```shell
+#!/bin/bash
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.instances=2 \
+    --conf spark.executor.memory=16G \
+    --conf spark.executor.cores=8 \
+    --total-executor-cores=16 \
+    --conf spark.eventLog.enabled=true \
+    --conf spark.driver.maxResultSize=2G \
+    --conf spark.memory.offHeap.enabled=true \
+    --conf spark.memory.offHeap.size=24g \
+    --jars $COMET_JAR \
+    --conf spark.driver.extraClassPath=$COMET_JAR \
+    --conf spark.executor.extraClassPath=$COMET_JAR \
+    --conf spark.plugins=org.apache.spark.CometPlugin \
+    --conf spark.comet.enabled=true \
+    --conf spark.comet.cast.allowIncompatible=true \
+    --conf spark.comet.exec.replaceSortMergeJoin=false \
+    --conf spark.comet.exec.shuffle.enabled=true \
+    --conf spark.comet.exec.shuffle.mode=auto \
+    --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
+    --conf spark.comet.exec.shuffle.compression.codec=lz4 \
+    --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
+    tpcbench.py \
+    --name comet \
+    --benchmark tpcds \
+    --data /mnt/bigdata/tpcds/sf100/ \
+    --queries ../../tpcds/ \
+    --output . \
+    --iterations 5
+```
\ No newline at end of file
diff --git a/docs/source/contributor-guide/benchmark-results/tpc-h.md 
b/docs/source/contributor-guide/benchmark-results/tpc-h.md
index 336deb7a7..d383cae85 100644
--- a/docs/source/contributor-guide/benchmark-results/tpc-h.md
+++ b/docs/source/contributor-guide/benchmark-results/tpc-h.md
@@ -25,21 +25,84 @@ and we encourage you to run these benchmarks in your own 
environments.
 
 The tracking issue for improving TPC-H performance is 
[#391](https://github.com/apache/datafusion-comet/issues/391).
 
-![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_allqueries.png)
+![](../../_static/images/benchmark-results/0.5.0/tpch_allqueries.png)
 
 Here is a breakdown showing relative performance of Spark and Comet for each 
query.
 
-![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_compare.png)
+![](../../_static/images/benchmark-results/0.5.0/tpch_queries_compare.png)
 
 The following chart shows how much Comet currently accelerates each query from 
the benchmark in relative terms.
 
-![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_speedup_rel.png)
+![](../../_static/images/benchmark-results/0.5.0/tpch_queries_speedup_rel.png)
 
 The following chart shows how much Comet currently accelerates each query from 
the benchmark in absolute terms.
 
-![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_speedup_abs.png)
+![](../../_static/images/benchmark-results/0.5.0/tpch_queries_speedup_abs.png)
 
 The raw results of these benchmarks in JSON format is available here:
 
 - [Spark](0.5.0/spark-tpch.json)
 - [Comet](0.5.0/comet-tpch.json)
+
+# Scripts
+
+Here are the scripts that were used to generate these results.
+
+## Apache Spark 
+
+```shell
+#!/bin/bash
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.instances=1 \
+    --conf spark.executor.cores=8 \
+    --conf spark.cores.max=8 \
+    --conf spark.executor.memory=16g \
+    --conf spark.memory.offHeap.enabled=true \
+    --conf spark.memory.offHeap.size=16g \
+    --conf spark.eventLog.enabled=true \
+    tpcbench.py \
+    --name spark \
+    --benchmark tpch \
+    --data /mnt/bigdata/tpch/sf100/ \
+    --queries ../../tpch/queries \
+    --output . \
+    --iterations 5
+
+```
+
+## Apache Spark + Comet
+
+```shell
+#!/bin/bash
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.instances=1 \
+    --conf spark.executor.cores=8 \
+    --conf spark.cores.max=8 \
+    --conf spark.executor.memory=16g \
+    --conf spark.memory.offHeap.enabled=true \
+    --conf spark.memory.offHeap.size=16g \
+    --conf spark.comet.exec.replaceSortMergeJoin=true \
+    --conf spark.eventLog.enabled=true \
+    --jars $COMET_JAR \
+    --driver-class-path $COMET_JAR \
+    --conf spark.driver.extraClassPath=$COMET_JAR \
+    --conf spark.executor.extraClassPath=$COMET_JAR \
+    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+    --conf spark.comet.enabled=true \
+    --conf spark.comet.exec.shuffle.enabled=true \
+    --conf spark.comet.exec.shuffle.mode=auto \
+    --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
+    --conf spark.comet.exec.shuffle.compression.codec=lz4 \
+    --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
+    tpcbench.py \
+    --name comet \
+    --benchmark tpch \
+    --data /mnt/bigdata/tpch/sf100/ \
+    --queries ../../tpch/queries \
+    --output . \
+    --iterations 5
+```
\ No newline at end of file
diff --git a/docs/source/contributor-guide/benchmarking.md 
b/docs/source/contributor-guide/benchmarking.md
index 173d598ac..e2372b3d6 100644
--- a/docs/source/contributor-guide/benchmarking.md
+++ b/docs/source/contributor-guide/benchmarking.md
@@ -24,62 +24,6 @@ benchmarking documentation and scripts are available in the 
[DataFusion Benchmar
 
 We also have many micro benchmarks that can be run from an IDE located 
[here](https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark).
 
 
-Here are example commands for running the benchmarks against a Spark cluster. 
This command will need to be 
-adapted based on the Spark environment and location of data files.
-
-These commands are intended to be run from the `runners/datafusion-comet` 
directory in the `datafusion-benchmarks` 
-repository.
-
-## Running Benchmarks Against Apache Spark
-
-```shell
-$SPARK_HOME/bin/spark-submit \
-    --master $SPARK_MASTER \
-    --conf spark.driver.memory=8G \
-    --conf spark.executor.instances=1 \
-    --conf spark.executor.memory=32G \
-    --conf spark.executor.cores=8 \
-    --conf spark.cores.max=8 \
-    tpcbench.py \
-    --benchmark tpch \
-    --data /mnt/bigdata/tpch/sf100/ \
-    --queries ../../tpch/queries \
-    --iterations 3
-```
-
-## Running Benchmarks Against Apache Spark with Apache DataFusion Comet Enabled
-
-### TPC-H
-
-```shell
-$SPARK_HOME/bin/spark-submit \
-    --master $SPARK_MASTER \
-    --conf spark.driver.memory=8G \
-    --conf spark.executor.instances=1 \
-    --conf spark.executor.memory=16G \
-    --conf spark.executor.cores=8 \
-    --conf spark.cores.max=8 \
-    --conf spark.memory.offHeap.enabled=true \
-    --conf spark.memory.offHeap.size=16g \
-    --jars $COMET_JAR \
-    --conf spark.driver.extraClassPath=$COMET_JAR \
-    --conf spark.executor.extraClassPath=$COMET_JAR \
-    --conf spark.plugins=org.apache.spark.CometPlugin \
-    --conf spark.comet.cast.allowIncompatible=true \
-    --conf spark.comet.exec.replaceSortMergeJoin=true \
-    --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
-    --conf spark.comet.exec.shuffle.enabled=true \
-    --conf spark.comet.exec.shuffle.mode=auto \
-    --conf spark.comet.exec.shuffle.enableFastEncoding=true \
-    --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
-    --conf spark.comet.exec.shuffle.compression.codec=lz4 \
-    tpcbench.py \
-    --benchmark tpch \
-    --data /mnt/bigdata/tpch/sf100/ \
-    --queries ../../tpch/queries \
-    --iterations 3
-```
-
 ### TPC-DS
 
 For TPC-DS, use `spark.comet.exec.replaceSortMergeJoin=false`.
diff --git a/docs/source/contributor-guide/debugging.md 
b/docs/source/contributor-guide/debugging.md
index 47d1f04c8..8a368cca2 100644
--- a/docs/source/contributor-guide/debugging.md
+++ b/docs/source/contributor-guide/debugging.md
@@ -130,7 +130,7 @@ Then build the Comet as 
[described](https://github.com/apache/arrow-datafusion-c
 Start Comet with `RUST_BACKTRACE=1`
 
 ```console
-RUST_BACKTRACE=1 $SPARK_HOME/spark-shell --jars 
spark/target/comet-spark-spark3.4_2.12-0.5.0-SNAPSHOT.jar --conf 
spark.plugins=org.apache.spark.CometPlugin --conf spark.comet.enabled=true 
--conf spark.comet.exec.enabled=true
+RUST_BACKTRACE=1 $SPARK_HOME/spark-shell --jars 
spark/target/comet-spark-spark3.4_2.12-0.6.0-SNAPSHOT.jar --conf 
spark.plugins=org.apache.spark.CometPlugin --conf spark.comet.enabled=true 
--conf spark.comet.exec.enabled=true
 ```
 
 Get the expanded exception details
diff --git a/docs/source/user-guide/installation.md 
b/docs/source/user-guide/installation.md
index 22d482e47..390c92638 100644
--- a/docs/source/user-guide/installation.md
+++ b/docs/source/user-guide/installation.md
@@ -74,7 +74,7 @@ See the [Comet Kubernetes Guide](kubernetes.md) guide.
 Make sure `SPARK_HOME` points to the same Spark version as Comet was built for.
 
 ```console
-export COMET_JAR=spark/target/comet-spark-spark3.4_2.12-0.5.0-SNAPSHOT.jar
+export COMET_JAR=spark/target/comet-spark-spark3.4_2.12-0.6.0-SNAPSHOT.jar
 
 $SPARK_HOME/bin/spark-shell \
     --jars $COMET_JAR \
@@ -130,7 +130,7 @@ explicitly contain Comet otherwise Spark may use a 
different class-loader for th
 components which will then fail at runtime. For example:
 
 ```
---driver-class-path spark/target/comet-spark-spark3.4_2.12-0.5.0-SNAPSHOT.jar
+--driver-class-path spark/target/comet-spark-spark3.4_2.12-0.6.0-SNAPSHOT.jar
 ```
 
 Some cluster managers may require additional configuration, see 
<https://spark.apache.org/docs/latest/cluster-overview.html>
diff --git a/fuzz-testing/pom.xml b/fuzz-testing/pom.xml
index 2184e54ee..0b45025c6 100644
--- a/fuzz-testing/pom.xml
+++ b/fuzz-testing/pom.xml
@@ -25,7 +25,7 @@ under the License.
     <parent>
         <groupId>org.apache.datafusion</groupId>
         
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
-        <version>0.5.0-SNAPSHOT</version>
+        <version>0.6.0-SNAPSHOT</version>
         <relativePath>../pom.xml</relativePath>
     </parent>
 
diff --git a/native/Cargo.lock b/native/Cargo.lock
index 7b00b7bc4..918f94a25 100644
--- a/native/Cargo.lock
+++ b/native/Cargo.lock
@@ -878,7 +878,7 @@ dependencies = [
 
 [[package]]
 name = "datafusion-comet"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "arrow",
  "arrow-array",
@@ -929,7 +929,7 @@ dependencies = [
 
 [[package]]
 name = "datafusion-comet-proto"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "prost 0.12.6",
  "prost-build",
@@ -937,7 +937,7 @@ dependencies = [
 
 [[package]]
 name = "datafusion-comet-spark-expr"
-version = "0.5.0"
+version = "0.6.0"
 dependencies = [
  "arrow",
  "arrow-array",
diff --git a/native/Cargo.toml b/native/Cargo.toml
index 72e2386bb..624d63ad2 100644
--- a/native/Cargo.toml
+++ b/native/Cargo.toml
@@ -20,7 +20,7 @@ members = ["core", "spark-expr", "proto"]
 resolver = "2"
 
 [workspace.package]
-version = "0.5.0"
+version = "0.6.0"
 homepage = "https://datafusion.apache.org/comet";
 repository = "https://github.com/apache/datafusion-comet";
 authors = ["Apache DataFusion <[email protected]>"]
@@ -48,8 +48,8 @@ datafusion-expr-common = { version = "44.0.0", 
default-features = false }
 datafusion-execution = { version = "44.0.0", default-features = false }
 datafusion-physical-plan = { version = "44.0.0", default-features = false }
 datafusion-physical-expr = { version = "44.0.0", default-features = false }
-datafusion-comet-spark-expr = { path = "spark-expr", version = "0.5.0" }
-datafusion-comet-proto = { path = "proto", version = "0.5.0" }
+datafusion-comet-spark-expr = { path = "spark-expr", version = "0.6.0" }
+datafusion-comet-proto = { path = "proto", version = "0.6.0" }
 chrono = { version = "0.4", default-features = false, features = ["clock"] }
 chrono-tz = { version = "0.8" }
 futures = "0.3.28"
diff --git a/pom.xml b/pom.xml
index 76e2288cc..4559d6741 100644
--- a/pom.xml
+++ b/pom.xml
@@ -30,7 +30,7 @@ under the License.
   </parent>
   <groupId>org.apache.datafusion</groupId>
   
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
-  <version>0.5.0-SNAPSHOT</version>
+  <version>0.6.0-SNAPSHOT</version>
   <packaging>pom</packaging>
   <name>Comet Project Parent POM</name>
 
diff --git a/spark-integration/pom.xml b/spark-integration/pom.xml
index 84c09c1c9..24b1f7a00 100644
--- a/spark-integration/pom.xml
+++ b/spark-integration/pom.xml
@@ -26,7 +26,7 @@ under the License.
     <parent>
         <groupId>org.apache.datafusion</groupId>
         
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
-        <version>0.5.0-SNAPSHOT</version>
+        <version>0.6.0-SNAPSHOT</version>
         <relativePath>../pom.xml</relativePath>
     </parent>
 
diff --git a/spark/pom.xml b/spark/pom.xml
index ad7590dbc..f15b0b2e8 100644
--- a/spark/pom.xml
+++ b/spark/pom.xml
@@ -26,7 +26,7 @@ under the License.
   <parent>
     <groupId>org.apache.datafusion</groupId>
     
<artifactId>comet-parent-spark${spark.version.short}_${scala.binary.version}</artifactId>
-    <version>0.5.0-SNAPSHOT</version>
+    <version>0.6.0-SNAPSHOT</version>
     <relativePath>../pom.xml</relativePath>
   </parent>
 
diff --git 
a/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala 
b/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala
index c9d8ce55b..addf73706 100644
--- a/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala
+++ b/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala
@@ -190,7 +190,7 @@ class CometSparkSessionExtensions
 
           // data source V1
           case scanExec @ FileSourceScanExec(
-                HadoopFsRelation(_, partitionSchema, _, _, _: 
ParquetFileFormat, _),
+                HadoopFsRelation(_, partitionSchema, _, _, fileFormat, _),
                 _: Seq[_],
                 requiredSchema,
                 _,
@@ -199,7 +199,8 @@ class CometSparkSessionExtensions
                 _,
                 _,
                 _)
-              if CometNativeScanExec.isSchemaSupported(requiredSchema)
+              if CometScanExec.isFileFormatSupported(fileFormat)
+                && CometNativeScanExec.isSchemaSupported(requiredSchema)
                 && CometNativeScanExec.isSchemaSupported(partitionSchema)
                 // TODO we only enable full native scan if COMET_EXEC_ENABLED 
is enabled
                 // but this is not really what we want .. we currently insert 
`CometScanExec`
@@ -1072,12 +1073,20 @@ class CometSparkSessionExtensions
         var firstNativeOp = true
         newPlan.transformDown {
           case op: CometNativeExec =>
-            if (firstNativeOp) {
+            val newPlan = if (firstNativeOp) {
               firstNativeOp = false
               op.convertBlock()
             } else {
               op
             }
+
+            // If reaching leaf node, reset `firstNativeOp` to true
+            // because it will start a new block in next iteration.
+            if (op.children.isEmpty) {
+              firstNativeOp = true
+            }
+
+            newPlan
           case op =>
             firstNativeOp = true
             op


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-comet) branch comet-parquet-exec updated: chore: [comet-parquet-exec] merge from main 20240116 (#1299)

Reply via email to