Re: [PR] Comet 0.16.0 blog post [datafusion-site]

via GitHub Sat, 09 May 2026 11:33:56 -0700


kevinjqliu commented on code in PR #171:
URL: https://github.com/apache/datafusion-site/pull/171#discussion_r3213683674



##########
content/blog/2026-05-07-datafusion-comet-0.16.0.md:
##########
@@ -0,0 +1,247 @@
+---
+layout: post
+title: Apache DataFusion Comet 0.16.0 Release
+date: 2026-05-07
+author: pmc
+categories: [subprojects]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+The Apache DataFusion PMC is pleased to announce version 0.16.0 of the 
[Comet](https://datafusion.apache.org/comet/) subproject.
+
+Comet is an accelerator for Apache Spark that translates Spark physical plans 
to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.
+
+This release covers approximately three weeks of development work and is the 
result of merging 115 PRs from 17
+contributors. See the [change log] for more information.
+
+[change log]: 
https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.16.0.md
+
+## Expanded Spark 4 Support
+
+Spark 4 is a major theme of this release. Comet now ships first-class support 
for both Spark 4.0.2 and
+Spark 4.1.1, with dedicated Maven profiles, shim sources, and CI matrices for 
each.
+
+- **Spark 4.1.1**: New `spark-4.1` Maven profile and shim sources, with 
Comet's PR test matrix and Spark SQL
+  test suites enabled against Spark 4.1.1. The default Maven profile has been 
updated to Spark 4.1 / Scala 2.13
+  to reflect that this is now the primary development target.
+- **Shared 4.x shims**: Identical pieces of the Spark 4.0 and 4.1 shims have 
been consolidated into a shared
+  `spark-4.x` source tree, reducing duplication as more 4.x minor versions 
land.
+- **Spark 4.0 / JDK 21**: Added a Spark 4.0 / JDK 21 CI profile to validate 
Comet on the JDK most users are
+  expected to deploy with Spark 4.
+
+### Adapting to Spark 4 Behavior Changes
+
+Spark 4 introduced a number of type, planner, and on-disk format changes 
relative to Spark 3.x. Several
+correctness fixes this release bring Comet's behavior in line with these 
changes:
+
+- **`Variant` type (new in Spark 4.0)**: Spark 4.0 added a new `Variant` data 
type for semi-structured
+  data. Comet does not yet read the shredded Variant on-disk format natively, 
and delegates these scans
+  to Spark.
+- **String collation (new in Spark 4.0)**: Spark 4.0 added collation support 
for `StringType`. Comet's
+  native operators do not yet implement non-default collations, so hash join 
and sort-merge join reject
+  collated string join keys, and shuffle, sort, and aggregate fall back to 
Spark when keys carry a
+  non-default collation.
+- **Wider `TimestampNTZType` usage**: Spark 4 uses `TimestampNTZType` 
(timestamp without time zone) in
+  more places than 3.x — for example, in expression return types and as the 
inferred type for some
+  literal forms. Comet adds support this cycle for cast to and from 
`timestamp_ntz`, cast from string to
+  `timestamp_ntz`, and `unix_timestamp` over `TimestampNTZType` inputs.
+- **`to_json` and `array_compact` (Spark 4.0)**: Spark 4.0 adjusted output 
formatting and return-type
+  metadata for these expressions; Comet now matches the new behavior.
+- **BloomFilter V2 (new in Spark 4.1)**: Spark 4.1 introduced a new 
BloomFilter binary format with
+  different bit-scattering. Comet now reads this format so that runtime 
filters produced by Spark 4.1
+  remain usable in native execution.
+- **Spark 4.1.1 analyzer refinements**: Spark 4.1.1 changed how struct 
projections handle the case where
+  every requested child field is missing from the Parquet file, and how 
`allowDecimalPrecisionLoss`
+  flows through the `DecimalPrecision` rule. Comet now preserves parent-struct 
nullness in the first
+  case and the stored `allowDecimalPrecisionLoss` flag in the second.
+
+Most of these behavior differences were caught because **Comet runs the full 
Apache Spark SQL test suite
+against each supported Spark version** — 3.4.3, 3.5.8, 4.0.2, and 4.1.1 — as 
part of CI. Running Spark's
+own correctness tests through Comet's native execution path is what surfaces 
semantic shifts like
+`TimestampNTZType` propagation, ANSI-driven cast and overflow changes, 
BloomFilter V2 encoding, and the
+4.1.1 analyzer rule changes, often before they show up in user workloads. As 
more Spark 4.x minor releases
+land, this same harness is what gives us confidence that Comet keeps up.
+
+### ANSI SQL Semantics
+
+Spark 4 enables ANSI SQL semantics by default. ANSI mode changes how 
arithmetic overflow, invalid casts,
+division by zero, and similar error conditions are handled, and Spark itself 
now treats this as the standard
+configuration rather than an opt-in.
+
+This is a critical area for any Spark accelerator: an engine that falls back 
to vanilla Spark whenever ANSI is
+enabled effectively does not run on Spark 4 by default. **Comet implements 
ANSI semantics for the expressions
+it supports natively**, including arithmetic overflow checks, ANSI cast 
behavior, and `try_*` variants.
+Queries running with `spark.sql.ansi.enabled=true` continue to be accelerated 
rather than falling back.
+
+See the [Comet Compatibility Guide] for details on which expressions have full 
ANSI coverage.
+
+[Comet Compatibility Guide]: 
https://datafusion.apache.org/comet/user-guide/latest/compatibility/index.html
+
+## Expanded Adaptive Execution Support
+
+Modern Spark plans are adaptive: AQE re-plans stages at runtime, Dynamic 
Partition Pruning (DPP) prunes
+fact-table partitions based on broadcast dimension filters, and 
`ReuseExchange` and `ReuseSubquery` ensure
+that a broadcast or subquery referenced in multiple places executes only once. 
For star-schema workloads,
+these mechanisms are not optional. They are often the difference between a 
query that reads 1% of the fact
+table and one that reads all of it.
+
+Prior to 0.16.0, Comet's native scans only partially participated in this 
machinery. `CometNativeScanExec`
+(the DataFusion-based native Parquet scan) fell back to Spark entirely 
whenever a DPP filter was present.
+`CometIcebergNativeScanExec` supported non-AQE DPP as of 0.15.0
+([#3349](https://github.com/apache/datafusion-comet/pull/3349)), but without 
broadcast exchange reuse, so
+the DPP subquery re-executed the dimension broadcast.
+
+Comet 0.16.0 closes both gaps and aligns the native Parquet and native Iceberg 
scans on a single DPP and
+subquery-resolution path:
+
+- **Non-AQE DPP for native Parquet, with broadcast exchange reuse**
+  ([#4011](https://github.com/apache/datafusion-comet/pull/4011),
+  [#4037](https://github.com/apache/datafusion-comet/pull/4037)): A new 
`CometSubqueryBroadcastExec` replaces
+  Spark's `SubqueryBroadcastExec` in DPP expressions and wraps a 
`CometBroadcastExchangeExec`, so
+  `ReuseExchangeAndSubquery` matches the join side and the DPP subquery and 
broadcasts the dimension exactly
+  once.
+- **AQE DPP for native Parquet** 
([#4112](https://github.com/apache/datafusion-comet/pull/4112)): Under AQE,
+  Spark's `PlanAdaptiveDynamicPruningFilters` cannot match Comet's broadcast 
hash join and would otherwise
+  rewrite DPP to `TrueLiteral`, disabling pruning. 0.16.0 intercepts 
`SubqueryAdaptiveBroadcastExec` before
+  Spark's rule runs, and applies Spark's decision tree in a Comet-aware rule 
that searches both the current
+  stage and the root plan for a reusable broadcast. DPP subqueries are 
registered in AQE's shared
+  `subqueryCache` so cross-plan DPP (for example, a main query and a scalar 
subquery referencing the same
+  dimension) deduplicates correctly. A narrower tagging-based fallback covers 
Spark 3.4, which lacks the
+  `injectQueryStageOptimizerRule` extension point.
+- **AQE DPP broadcast reuse for native Iceberg**
+  ([#4215](https://github.com/apache/datafusion-comet/pull/4215)): Lifts 
`runtimeFilters` to a top-level
+  constructor field on `CometIcebergNativeScanExec` (mirroring 
`BatchScanExec`), so Spark's
+  expression-rewrite passes can see and convert the DPP subquery. The same 
`CometSubqueryBroadcastExec`
+  machinery from the Parquet path now handles the Iceberg case.
+- **Scalar subquery pushdown and AQE subquery reuse**
+  ([#4053](https://github.com/apache/datafusion-comet/pull/4053),
+  [SPARK-43402](https://issues.apache.org/jira/browse/SPARK-43402)): 
`CometNativeScanExec` now participates
+  in scalar subquery pushdown into Parquet data filters, and in AQE-time 
subquery deduplication via a new
+  `CometReuseSubquery` rule that re-applies Spark's `ReuseAdaptiveSubquery` 
algorithm after Comet's node
+  replacements.
+
+**Measured impact on TPC-DS:** 78 queries previously fell back to Spark 
whenever
+DPP filters were planned, running 30–50% natively. With native DPP in 0.16.0, 
the same queries run 80–97%
+natively. Representative examples:
+
+| Query | Before | After |
+|-------|--------|-------|
+| q1    | 36%    | 96%   |
+| q4    | 31%    | 95%   |
+| q31   | 31%    | 95%   |
+| q74   | 32%    | 95%   |
+| q92   | 36%    | 95%   |
+
+Several Spark SQL DPP tests that Comet previously skipped are re-enabled to 
guarantee Spark compatibility
+and prevent regressions.
+
+## Improved TPC-DS Benchmark Results
+
+TODO: side-by-side TPC-DS benchmark results comparing 0.15.0 and 0.16.0.

Review Comment:
   flagging the TODO here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Comet 0.16.0 blog post [datafusion-site]

Reply via email to