Re: [PR] Comet 0.16.0 blog post [datafusion-site]

via GitHub Thu, 07 May 2026 12:02:07 -0700


mbutrovich commented on code in PR #171:
URL: https://github.com/apache/datafusion-site/pull/171#discussion_r3204010182



##########
content/blog/2026-05-07-datafusion-comet-0.16.0.md:
##########
@@ -0,0 +1,213 @@
+---
+layout: post
+title: Apache DataFusion Comet 0.16.0 Release
+date: 2026-05-07
+author: pmc
+categories: [subprojects]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+The Apache DataFusion PMC is pleased to announce version 0.16.0 of the 
[Comet](https://datafusion.apache.org/comet/) subproject.
+
+Comet is an accelerator for Apache Spark that translates Spark physical plans 
to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.
+
+This release covers approximately three weeks of development work and is the 
result of merging 115 PRs from 17
+contributors. See the [change log] for more information.
+
+[change log]: 
https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.16.0.md
+
+## Expanded Spark 4 Support
+
+Spark 4 is a major theme of this release. Comet now ships first-class support 
for both Spark 4.0.2 and
+Spark 4.1.1, with dedicated Maven profiles, shim sources, and CI matrices for 
each.
+
+- **Spark 4.1.1**: New `spark-4.1` Maven profile and shim sources, with 
Comet's PR test matrix and Spark SQL
+  test suites enabled against Spark 4.1.1. The default Maven profile has been 
updated to Spark 4.1 / Scala 2.13
+  to reflect that this is now the primary development target.
+- **Shared 4.x shims**: Identical pieces of the Spark 4.0 and 4.1 shims have 
been consolidated into a shared
+  `spark-4.x` source tree, reducing duplication as more 4.x minor versions 
land.
+- **Spark 4.0 / JDK 21**: Added a Spark 4.0 / JDK 21 CI profile to validate 
Comet on the JDK most users are
+  expected to deploy with Spark 4.
+
+### Adapting to Spark 4 Behavior Changes
+
+Spark 4 introduced a number of type, planner, and on-disk format changes 
relative to Spark 3.x. Several
+correctness fixes this cycle bring Comet's behavior in line with these changes:
+
+- **`Variant` type (new in Spark 4.0)**: Spark 4.0 added a new `Variant` data 
type for semi-structured
+  data. Comet does not yet read the shredded Variant on-disk format natively, 
and delegates these scans
+  to Spark.
+- **String collation (new in Spark 4.0)**: Spark 4.0 added collation support 
for `StringType`. Comet's
+  native operators do not yet implement non-default collations, so hash join 
and sort-merge join reject
+  collated string join keys, and shuffle, sort, and aggregate fall back to 
Spark when keys carry a
+  non-default collation.
+- **Wider `TimestampNTZType` usage**: Spark 4 uses `TimestampNTZType` 
(timestamp without time zone) in
+  more places than 3.x — for example, in expression return types and as the 
inferred type for some
+  literal forms. Comet adds support this cycle for cast to and from 
`timestamp_ntz`, cast from string to
+  `timestamp_ntz`, and `unix_timestamp` over `TimestampNTZType` inputs.
+- **DSv2 scalar subquery pushdown 
([SPARK-43402](https://issues.apache.org/jira/browse/SPARK-43402))**:
+  Spark extended scalar subquery pushdown and reuse to V2 file scans. 
`CometNativeScanExec` now
+  participates in this pushdown so DSv2 plans benefit from the same subquery 
reuse as Spark's own scan.
+- **`to_json` and `array_compact` (Spark 4.0)**: Spark 4.0 adjusted output 
formatting and return-type
+  metadata for these expressions; Comet now matches the new behavior.
+- **BloomFilter V2 (new in Spark 4.1)**: Spark 4.1 introduced a new 
BloomFilter binary format with
+  different bit-scattering. Comet now reads this format so that runtime 
filters produced by Spark 4.1
+  remain usable in native execution.
+- **Spark 4.1.1 analyzer refinements**: Spark 4.1.1 changed how struct 
projections handle the case where
+  every requested child field is missing from the Parquet file, and how 
`allowDecimalPrecisionLoss`
+  flows through the `DecimalPrecision` rule. Comet now preserves parent-struct 
nullness in the first
+  case and the stored `allowDecimalPrecisionLoss` flag in the second.
+
+Most of these behavior differences were caught because **Comet runs the full 
Apache Spark SQL test suite
+against each supported Spark version** — 3.4.3, 3.5.8, 4.0.2, and 4.1.1 — as 
part of CI. Running Spark's
+own correctness tests through Comet's native execution path is what surfaces 
semantic shifts like
+`TimestampNTZType` propagation, ANSI-driven cast and overflow changes, 
BloomFilter V2 encoding, and the
+4.1.1 analyzer rule changes, often before they show up in user workloads. As 
more Spark 4.x minor releases
+land, this same harness is what gives us confidence that Comet keeps up.
+
+### ANSI SQL Semantics
+
+Spark 4 enables ANSI SQL semantics by default. ANSI mode changes how 
arithmetic overflow, invalid casts,
+division by zero, and similar error conditions are handled, and Spark itself 
now treats this as the standard
+configuration rather than an opt-in.
+
+This is a critical area for any Spark accelerator: an engine that falls back 
to vanilla Spark whenever ANSI is
+enabled effectively does not run on Spark 4 by default. **Comet implements 
ANSI semantics for the expressions
+it supports natively**, including arithmetic overflow checks, ANSI cast 
behavior, and `try_*` variants.
+Queries running with `spark.sql.ansi.enabled=true` continue to be accelerated 
rather than falling back.
+
+See the [Comet Compatibility Guide] for details on which expressions have full 
ANSI coverage, and the
+[ANSI mode tracking 
issue](https://github.com/apache/datafusion-comet/issues/313) for ongoing work 
to extend
+coverage further.
+
+[Comet Compatibility Guide]: 
https://datafusion.apache.org/comet/user-guide/latest/compatibility/index.html
+
+## Dynamic Partition Pruning for Native Parquet Scans
+
+Comet 0.15.0 introduced Dynamic Partition Pruning (DPP) for the native Iceberg 
reader. Comet 0.16.0 extends
+DPP support across the rest of Comet's native scan paths, making it the 
default for the workloads most users
+actually run.
+
+- **Non-AQE DPP for native Parquet scans** 
([#4011](https://github.com/apache/datafusion-comet/pull/4011)):
+  DPP filters derived from broadcast subqueries are now honored by Comet's 
native Parquet scan, with broadcast
+  exchanges reused across the DPP subquery and the join.
+- **AQE DPP for native Parquet scans** 
([#4112](https://github.com/apache/datafusion-comet/pull/4112)):
+  When Adaptive Query Execution is enabled, broadcast reuse for DPP subqueries 
is wired through the AQE
+  re-planning path so DPP pruning fires after AQE rewrites the plan.
+- **AQE DPP broadcast reuse for native Iceberg scans** 
([#4215](https://github.com/apache/datafusion-comet/pull/4215)):
+  Brings the same AQE-aware broadcast reuse to the native Iceberg reader, 
completing DPP coverage across
+  Comet's native scan implementations.
+
+For star-schema-style workloads, DPP can substantially reduce I/O by pruning 
fact-table partitions based on
+filters applied to dimension tables at runtime. With this release, those 
benefits apply uniformly whether the
+underlying table is Parquet or Iceberg, and whether or not AQE is enabled.
+
+A number of previously-skipped DPP tests have been re-enabled this cycle as 
the implementation matured,
+including non-AQE DPP test cases and the `DynamicPartitionPruning` 
static-scan-metrics test.

Review Comment:
   ```suggestion
   ## Expanded Adaptive Execution Support
   
   Modern Spark plans are adaptive: AQE re-plans stages at runtime, Dynamic 
Partition Pruning (DPP) prunes
   fact-table partitions based on broadcast dimension filters, and 
`ReuseExchange` and `ReuseSubquery` ensure
   that a broadcast or subquery referenced in multiple places executes only 
once. For star-schema workloads,
   these mechanisms are not optional. They are often the difference between a 
query that reads 1% of the fact
   table and one that reads all of it.
   
   Prior to 0.16.0, Comet's native scans only partially participated in this 
machinery. `CometNativeScanExec`
   (the DataFusion-based native Parquet scan) fell back to Spark entirely 
whenever a DPP filter was present.
   `CometIcebergNativeScanExec` supported non-AQE DPP as of 0.15.0
   ([#3349](https://github.com/apache/datafusion-comet/pull/3349)), but without 
broadcast exchange reuse, so
   the DPP subquery re-executed the dimension broadcast.
   
   Comet 0.16.0 closes both gaps and aligns the native Parquet and native 
Iceberg scans on a single DPP and
   subquery-resolution path:
   
   - **Non-AQE DPP for native Parquet, with broadcast exchange reuse**
     ([#4011](https://github.com/apache/datafusion-comet/pull/4011),
     [#4037](https://github.com/apache/datafusion-comet/pull/4037)): A new 
`CometSubqueryBroadcastExec` replaces
     Spark's `SubqueryBroadcastExec` in DPP expressions and wraps a 
`CometBroadcastExchangeExec`, so
     `ReuseExchangeAndSubquery` matches the join side and the DPP subquery and 
broadcasts the dimension exactly
     once.
   - **AQE DPP for native Parquet** 
([#4112](https://github.com/apache/datafusion-comet/pull/4112)): Under AQE,
     Spark's `PlanAdaptiveDynamicPruningFilters` cannot match Comet's broadcast 
hash join and would otherwise
     rewrite DPP to `TrueLiteral`, disabling pruning. 0.16.0 intercepts 
`SubqueryAdaptiveBroadcastExec` before
     Spark's rule runs, and applies Spark's decision tree in a Comet-aware rule 
that searches both the current
     stage and the root plan for a reusable broadcast. DPP subqueries are 
registered in AQE's shared
     `subqueryCache` so cross-plan DPP (for example, a main query and a scalar 
subquery referencing the same
     dimension) deduplicates correctly. A narrower tagging-based fallback 
covers Spark 3.4, which lacks the
     `injectQueryStageOptimizerRule` extension point.
   - **AQE DPP broadcast reuse for native Iceberg**
     ([#4215](https://github.com/apache/datafusion-comet/pull/4215)): Lifts 
`runtimeFilters` to a top-level
     constructor field on `CometIcebergNativeScanExec` (mirroring 
`BatchScanExec`), so Spark's
     expression-rewrite passes can see and convert the DPP subquery. The same 
`CometSubqueryBroadcastExec`
     machinery from the Parquet path now handles the Iceberg case.
   - **Scalar subquery pushdown and AQE subquery reuse**
     ([#4053](https://github.com/apache/datafusion-comet/pull/4053),
     [SPARK-43402](https://issues.apache.org/jira/browse/SPARK-43402)): 
`CometNativeScanExec` now participates
     in scalar subquery pushdown into Parquet data filters, and in AQE-time 
subquery deduplication via a new
     `CometReuseSubquery` rule that re-applies Spark's `ReuseAdaptiveSubquery` 
algorithm after Comet's node
     replacements.
   
   **Measured impact on TPC-DS:** 78 queries previously fell back to Spark 
whenever
   DPP filters were planned, running 30–50% natively. With native DPP in 
0.16.0, the same queries run 80–97%
   natively. Representative examples:
   
   | Query | Before | After |
   |-------|--------|-------|
   | q1    | 36%    | 96%   |
   | q4    | 31%    | 95%   |
   | q31   | 31%    | 95%   |
   | q74   | 32%    | 95%   |
   | q92   | 36%    | 95%   |
   
   Several Spark SQL DPP tests that Comet previously skipped are re-enabled to 
guarantee Spark compatibility
   and prevent regressions.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Comet 0.16.0 blog post [datafusion-site]

Reply via email to