mbutrovich commented on code in PR #171:
URL: https://github.com/apache/datafusion-site/pull/171#discussion_r3204010182
##########
content/blog/2026-05-07-datafusion-comet-0.16.0.md:
##########
@@ -0,0 +1,213 @@
+---
+layout: post
+title: Apache DataFusion Comet 0.16.0 Release
+date: 2026-05-07
+author: pmc
+categories: [subprojects]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+The Apache DataFusion PMC is pleased to announce version 0.16.0 of the
[Comet](https://datafusion.apache.org/comet/) subproject.
+
+Comet is an accelerator for Apache Spark that translates Spark physical plans
to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.
+
+This release covers approximately three weeks of development work and is the
result of merging 115 PRs from 17
+contributors. See the [change log] for more information.
+
+[change log]:
https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.16.0.md
+
+## Expanded Spark 4 Support
+
+Spark 4 is a major theme of this release. Comet now ships first-class support
for both Spark 4.0.2 and
+Spark 4.1.1, with dedicated Maven profiles, shim sources, and CI matrices for
each.
+
+- **Spark 4.1.1**: New `spark-4.1` Maven profile and shim sources, with
Comet's PR test matrix and Spark SQL
+ test suites enabled against Spark 4.1.1. The default Maven profile has been
updated to Spark 4.1 / Scala 2.13
+ to reflect that this is now the primary development target.
+- **Shared 4.x shims**: Identical pieces of the Spark 4.0 and 4.1 shims have
been consolidated into a shared
+ `spark-4.x` source tree, reducing duplication as more 4.x minor versions
land.
+- **Spark 4.0 / JDK 21**: Added a Spark 4.0 / JDK 21 CI profile to validate
Comet on the JDK most users are
+ expected to deploy with Spark 4.
+
+### Adapting to Spark 4 Behavior Changes
+
+Spark 4 introduced a number of type, planner, and on-disk format changes
relative to Spark 3.x. Several
+correctness fixes this cycle bring Comet's behavior in line with these changes:
+
+- **`Variant` type (new in Spark 4.0)**: Spark 4.0 added a new `Variant` data
type for semi-structured
+ data. Comet does not yet read the shredded Variant on-disk format natively,
and delegates these scans
+ to Spark.
+- **String collation (new in Spark 4.0)**: Spark 4.0 added collation support
for `StringType`. Comet's
+ native operators do not yet implement non-default collations, so hash join
and sort-merge join reject
+ collated string join keys, and shuffle, sort, and aggregate fall back to
Spark when keys carry a
+ non-default collation.
+- **Wider `TimestampNTZType` usage**: Spark 4 uses `TimestampNTZType`
(timestamp without time zone) in
+ more places than 3.x — for example, in expression return types and as the
inferred type for some
+ literal forms. Comet adds support this cycle for cast to and from
`timestamp_ntz`, cast from string to
+ `timestamp_ntz`, and `unix_timestamp` over `TimestampNTZType` inputs.
+- **DSv2 scalar subquery pushdown
([SPARK-43402](https://issues.apache.org/jira/browse/SPARK-43402))**:
+ Spark extended scalar subquery pushdown and reuse to V2 file scans.
`CometNativeScanExec` now
+ participates in this pushdown so DSv2 plans benefit from the same subquery
reuse as Spark's own scan.
+- **`to_json` and `array_compact` (Spark 4.0)**: Spark 4.0 adjusted output
formatting and return-type
+ metadata for these expressions; Comet now matches the new behavior.
+- **BloomFilter V2 (new in Spark 4.1)**: Spark 4.1 introduced a new
BloomFilter binary format with
+ different bit-scattering. Comet now reads this format so that runtime
filters produced by Spark 4.1
+ remain usable in native execution.
+- **Spark 4.1.1 analyzer refinements**: Spark 4.1.1 changed how struct
projections handle the case where
+ every requested child field is missing from the Parquet file, and how
`allowDecimalPrecisionLoss`
+ flows through the `DecimalPrecision` rule. Comet now preserves parent-struct
nullness in the first
+ case and the stored `allowDecimalPrecisionLoss` flag in the second.
+
+Most of these behavior differences were caught because **Comet runs the full
Apache Spark SQL test suite
+against each supported Spark version** — 3.4.3, 3.5.8, 4.0.2, and 4.1.1 — as
part of CI. Running Spark's
+own correctness tests through Comet's native execution path is what surfaces
semantic shifts like
+`TimestampNTZType` propagation, ANSI-driven cast and overflow changes,
BloomFilter V2 encoding, and the
+4.1.1 analyzer rule changes, often before they show up in user workloads. As
more Spark 4.x minor releases
+land, this same harness is what gives us confidence that Comet keeps up.
+
+### ANSI SQL Semantics
+
+Spark 4 enables ANSI SQL semantics by default. ANSI mode changes how
arithmetic overflow, invalid casts,
+division by zero, and similar error conditions are handled, and Spark itself
now treats this as the standard
+configuration rather than an opt-in.
+
+This is a critical area for any Spark accelerator: an engine that falls back
to vanilla Spark whenever ANSI is
+enabled effectively does not run on Spark 4 by default. **Comet implements
ANSI semantics for the expressions
+it supports natively**, including arithmetic overflow checks, ANSI cast
behavior, and `try_*` variants.
+Queries running with `spark.sql.ansi.enabled=true` continue to be accelerated
rather than falling back.
+
+See the [Comet Compatibility Guide] for details on which expressions have full
ANSI coverage, and the
+[ANSI mode tracking
issue](https://github.com/apache/datafusion-comet/issues/313) for ongoing work
to extend
+coverage further.
+
+[Comet Compatibility Guide]:
https://datafusion.apache.org/comet/user-guide/latest/compatibility/index.html
+
+## Dynamic Partition Pruning for Native Parquet Scans
+
+Comet 0.15.0 introduced Dynamic Partition Pruning (DPP) for the native Iceberg
reader. Comet 0.16.0 extends
+DPP support across the rest of Comet's native scan paths, making it the
default for the workloads most users
+actually run.
+
+- **Non-AQE DPP for native Parquet scans**
([#4011](https://github.com/apache/datafusion-comet/pull/4011)):
+ DPP filters derived from broadcast subqueries are now honored by Comet's
native Parquet scan, with broadcast
+ exchanges reused across the DPP subquery and the join.
+- **AQE DPP for native Parquet scans**
([#4112](https://github.com/apache/datafusion-comet/pull/4112)):
+ When Adaptive Query Execution is enabled, broadcast reuse for DPP subqueries
is wired through the AQE
+ re-planning path so DPP pruning fires after AQE rewrites the plan.
+- **AQE DPP broadcast reuse for native Iceberg scans**
([#4215](https://github.com/apache/datafusion-comet/pull/4215)):
+ Brings the same AQE-aware broadcast reuse to the native Iceberg reader,
completing DPP coverage across
+ Comet's native scan implementations.
+
+For star-schema-style workloads, DPP can substantially reduce I/O by pruning
fact-table partitions based on
+filters applied to dimension tables at runtime. With this release, those
benefits apply uniformly whether the
+underlying table is Parquet or Iceberg, and whether or not AQE is enabled.
+
+A number of previously-skipped DPP tests have been re-enabled this cycle as
the implementation matured,
+including non-AQE DPP test cases and the `DynamicPartitionPruning`
static-scan-metrics test.
Review Comment:
```suggestion
## Expanded Adaptive Execution Support
Modern Spark plans are adaptive: AQE re-plans stages at runtime, Dynamic
Partition Pruning (DPP) prunes
fact-table partitions based on broadcast dimension filters, and
`ReuseExchange` and `ReuseSubquery` ensure
that a broadcast or subquery referenced in multiple places executes only
once. For star-schema workloads,
these mechanisms are not optional. They are often the difference between a
query that reads 1% of the fact
table and one that reads all of it.
Prior to 0.16.0, Comet's native scans only partially participated in this
machinery. `CometNativeScanExec`
(the DataFusion-based native Parquet scan) fell back to Spark entirely
whenever a DPP filter was present.
`CometIcebergNativeScanExec` supported non-AQE DPP as of 0.15.0
([#3349](https://github.com/apache/datafusion-comet/pull/3349)), but without
broadcast exchange reuse, so
the DPP subquery re-executed the dimension broadcast.
Comet 0.16.0 closes both gaps and aligns the native Parquet and native
Iceberg scans on a single DPP and
subquery-resolution path:
- **Non-AQE DPP for native Parquet, with broadcast exchange reuse**
([#4011](https://github.com/apache/datafusion-comet/pull/4011),
[#4037](https://github.com/apache/datafusion-comet/pull/4037)): A new
`CometSubqueryBroadcastExec` replaces
Spark's `SubqueryBroadcastExec` in DPP expressions and wraps a
`CometBroadcastExchangeExec`, so
`ReuseExchangeAndSubquery` matches the join side and the DPP subquery and
broadcasts the dimension exactly
once.
- **AQE DPP for native Parquet**
([#4112](https://github.com/apache/datafusion-comet/pull/4112)): Under AQE,
Spark's `PlanAdaptiveDynamicPruningFilters` cannot match Comet's broadcast
hash join and would otherwise
rewrite DPP to `TrueLiteral`, disabling pruning. 0.16.0 intercepts
`SubqueryAdaptiveBroadcastExec` before
Spark's rule runs, and applies Spark's decision tree in a Comet-aware rule
that searches both the current
stage and the root plan for a reusable broadcast. DPP subqueries are
registered in AQE's shared
`subqueryCache` so cross-plan DPP (for example, a main query and a scalar
subquery referencing the same
dimension) deduplicates correctly. A narrower tagging-based fallback
covers Spark 3.4, which lacks the
`injectQueryStageOptimizerRule` extension point.
- **AQE DPP broadcast reuse for native Iceberg**
([#4215](https://github.com/apache/datafusion-comet/pull/4215)): Lifts
`runtimeFilters` to a top-level
constructor field on `CometIcebergNativeScanExec` (mirroring
`BatchScanExec`), so Spark's
expression-rewrite passes can see and convert the DPP subquery. The same
`CometSubqueryBroadcastExec`
machinery from the Parquet path now handles the Iceberg case.
- **Scalar subquery pushdown and AQE subquery reuse**
([#4053](https://github.com/apache/datafusion-comet/pull/4053),
[SPARK-43402](https://issues.apache.org/jira/browse/SPARK-43402)):
`CometNativeScanExec` now participates
in scalar subquery pushdown into Parquet data filters, and in AQE-time
subquery deduplication via a new
`CometReuseSubquery` rule that re-applies Spark's `ReuseAdaptiveSubquery`
algorithm after Comet's node
replacements.
**Measured impact on TPC-DS:** 78 queries previously fell back to Spark
whenever
DPP filters were planned, running 30–50% natively. With native DPP in
0.16.0, the same queries run 80–97%
natively. Representative examples:
| Query | Before | After |
|-------|--------|-------|
| q1 | 36% | 96% |
| q4 | 31% | 95% |
| q31 | 31% | 95% |
| q74 | 32% | 95% |
| q92 | 36% | 95% |
Several Spark SQL DPP tests that Comet previously skipped are re-enabled to
guarantee Spark compatibility
and prevent regressions.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]