[I] Bug triage results: 2026-05-26 [datafusion-comet]

via GitHub Tue, 26 May 2026 11:10:13 -0700


andygrove opened a new issue, #4441:
URL: https://github.com/apache/datafusion-comet/issues/4441


   # Bug triage results: 2026-05-26
   
   Triage pass over the open `requires-triage` queue, per the project [Bug 
Triage 
Guide](https://github.com/apache/datafusion-comet/blob/main/docs/source/contributor-guide/bug_triage.md).
   
   - Total issues processed: 22
   - Labels applied to: 20
   - Skipped: 2 (previous triage summaries)
   - `priority:critical`: 1
   - `priority:high`: 1 (preserved existing label set by reporter)
   - `priority:medium`: 7
   - `priority:low`: 11
   
   Labels have already been applied and `requires-triage` removed from each 
issue listed under "Triaged". A reviewer should spot-check the calls and close 
this issue when satisfied. To correct a label, edit the affected issue directly.
   
   ## Triaged
   
   ### priority:critical
   
   - GetStructField returns non-null for fields of a NULL struct (missing 
null-mask propagation) 
([#4432](https://github.com/apache/datafusion-comet/issues/4432))
     - Area labels: `area:expressions`
     - Rationale: Silent wrong results — `GetStructField` returns the child 
column verbatim without applying the parent struct's null mask, so 
`isnotnull(structCol.field)` returns `true` even when `structCol` is null. The 
guide's first decision-tree question ("Can this bug cause silent wrong 
results?") classifies this as `priority:critical`, and the reporter shows a 
concrete Delta checkpoint reproducer.
   
   ### priority:high
   
   - Iceberg 1.11 support 
([#4381](https://github.com/apache/datafusion-comet/issues/4381))
     - Area labels: `area:Iceberg`
     - Rationale: Already labeled `priority:high` and `area:Iceberg` by the 
reporter; preserved as-is. Iceberg 1.11 is the first release with Spark 4.1 
support, so Comet has no Iceberg coverage on Spark 4.1 until this lands — a 
sizeable feature gap rather than a typical missing-feature `medium`.
   
   ### priority:medium
   
   - ToJson PartialEq<dyn Any> impl is inconsistent with PartialEq impl 
([#4430](https://github.com/apache/datafusion-comet/issues/4430))
     - Area labels: `area:expressions`
     - Rationale: Clear bug — the two equality impls compare different field 
sets, so two `ToJson` exprs differing only in `timezone` or 
`ignore_null_fields` compare equal through the `Arc<dyn PhysicalExpr>` path. 
Latent correctness risk if DataFusion's planner uses it for dedup/caching, but 
no demonstrated query producing wrong results; `priority:medium` per the 
"broken features that have workarounds" bucket, with an escalation note below.
   - [EPIC] Implement all Spark date/time expressions 
([#4418](https://github.com/apache/datafusion-comet/issues/4418))
     - Area labels: `area:expressions`
     - Rationale: Tracking EPIC for missing date/time expression coverage; per 
the guide, missing expression support is `priority:medium` ("functional bugs / 
broken features that have workarounds"), since unsupported expressions fall 
back to Spark.
   - CometHashAggregateExec doesn't participate in Spark's 
AQEPropagateEmptyRelation optimization 
([#4412](https://github.com/apache/datafusion-comet/issues/4412))
     - Area labels: `area:aggregation`
     - Rationale: The reporter notes "query results are correct under Comet" — 
what's lost is an AQE short-circuit, which is a functional/perf regression (not 
correctness). Fits `priority:medium` for broken behavior with a workaround.
   - Implement TimeType support: Infrastructure - shuffle 
([#4396](https://github.com/apache/datafusion-comet/issues/4396))
     - Area labels: `area:shuffle`
     - Rationale: Missing type support causing fallback to Spark for any 
shuffle on TimeType columns. Per the guide, missing feature support with a 
workaround (Spark fallback) is `priority:medium`.
   - Drop the per-batch Comet→Spark buffer copy in CometColumnarPythonInput 
([#4383](https://github.com/apache/datafusion-comet/issues/4383))
     - Area labels: none
     - Rationale: Performance enhancement — drops one of two buffer copies on 
the JVM→Python pyarrow UDF transport path. Per the guide, "performance 
regressions / broken features that have workarounds" is `priority:medium`.
   - Implement TimeType support: Infrastructure - sort, min/max 
([#4379](https://github.com/apache/datafusion-comet/issues/4379))
     - Area labels: `area:aggregation`
     - Rationale: Companion to #4396 — missing TimeType support in sort and 
min/max aggregates causes Spark fallback. Same `priority:medium` rationale.
   - CometIcebergNativeScanExec: propagate outputOrdering from originalPlan 
instead of hardcoding Nil 
([#4367](https://github.com/apache/datafusion-comet/issues/4367))
     - Area labels: `area:scan`
     - Rationale: Functional/perf gap — Comet inserts an unnecessary 
`CometSortExec` above Iceberg native scans for sort-merge joins because 
`outputOrdering` is hardcoded to `Nil`. Lost optimization with a clear fix 
path; `priority:medium`.
   
   ### priority:low
   
   - Revisit the case for native columnar-to-row? 
([#4440](https://github.com/apache/datafusion-comet/issues/4440))
     - Area labels: none
     - Rationale: Design/discussion question about whether the rationale for 
`CometNativeColumnarToRowExec` still holds (benchmarks within ~5% of JVM path, 
GC pressure benefit not measured). Not a bug; per the guide, "everything else" 
falls into `priority:low`.
   - change default Maven profile to Spark 4.0 
([#4434](https://github.com/apache/datafusion-comet/issues/4434))
     - Area labels: `area:ci`
     - Rationale: Build/configuration choice (which Spark profile defaults to) 
— tooling decision rather than a defect. `priority:low` per the guide's 
tooling/build bucket.
   - Refactor CometPlainVector by replacing boolean flags with an enum 
([#4433](https://github.com/apache/datafusion-comet/issues/4433))
     - Area labels: none
     - Rationale: Code-quality refactor follow-up; no functional change. 
Cosmetic per the guide → `priority:low`.
   - Reorganize user and contributor guide navigation for clearer information 
architecture ([#4421](https://github.com/apache/datafusion-comet/issues/4421))
     - Area labels: none
     - Rationale: Documentation reorganization; no functional impact. 
Cosmetic/tooling → `priority:low`.
   - Add a docs-review Claude skill to enforce style guide and documentation 
quality ([#4420](https://github.com/apache/datafusion-comet/issues/4420))
     - Area labels: none
     - Rationale: Tooling enhancement (a Claude skill for docs review); no 
runtime impact. `priority:low`.
   - Establish nomenclature style guide and audit operator names for clarity 
([#4419](https://github.com/apache/datafusion-comet/issues/4419))
     - Area labels: none
     - Rationale: Documentation/style guide work; cosmetic. `priority:low`.
   - Reduce Github Action Usage 
([#4406](https://github.com/apache/datafusion-comet/issues/4406))
     - Area labels: `area:ci`
     - Rationale: CI/tooling enhancement to reduce Action minutes. Per the 
guide, CI tooling → `priority:low`.
   - Implement tiered CI approach 
([#4389](https://github.com/apache/datafusion-comet/issues/4389))
     - Area labels: `area:ci`
     - Rationale: CI/tooling enhancement (tiered pipeline). `priority:low`.
   - Add randomised fuzz harness for pyarrow UDF vector-copy path 
([#4384](https://github.com/apache/datafusion-comet/issues/4384))
     - Area labels: none
     - Rationale: Test infrastructure enhancement; no runtime impact. The guide 
explicitly lists "test-only failures, tooling" as `priority:low`.
   - Upgrade Spark 4.1.1 to 4.1.2 
([#4380](https://github.com/apache/datafusion-comet/issues/4380))
     - Area labels: none
     - Rationale: Routine dependency upgrade required for 0.17.0 per project 
versioning policy; tooling/maintenance → `priority:low`.
   - Investigate Spark SQL test suites that load Comet without registering 
CometShuffleManager 
([#4377](https://github.com/apache/datafusion-comet/issues/4377))
     - Area labels: `area:shuffle`, `spark sql tests`
     - Rationale: Test investigation — surfaces an asymmetry between how 
`CometSparkSessionExtensions` and `CometShuffleManager` are wired across Spark 
SQL suites. Test-only / tooling concern; `priority:low`.
   
   ## Escalations to consider
   
   - ToJson PartialEq<dyn Any> impl is inconsistent with PartialEq impl 
([#4430](https://github.com/apache/datafusion-comet/issues/4430))
     - The guide's first principle is "correctness over crashes" and the bug 
pattern (inconsistent equality used by the trait-object path) could in 
principle produce silently wrong query results if DataFusion's planner 
deduplicates two `ToJson` exprs that differ only in `timezone` or 
`ignore_null_fields`. Filed at `priority:medium` because the user-visible 
impact is not demonstrated. If a reviewer can confirm the planner-cache path 
actually fires on this difference, escalate to `priority:high` or 
`priority:critical`.
   
   ## Skipped — needs more info
   
   - Bug triage results: 2026-05-18 
([#4359](https://github.com/apache/datafusion-comet/issues/4359))
     - Previous triage summary issue rather than a bug; per this skill's 
contract it is the human reviewer's job to close it after spot-checking. Left 
as-is.
   - Bug triage results: 2026-05-11 
([#4287](https://github.com/apache/datafusion-comet/issues/4287))
     - Previous triage summary issue rather than a bug; same as above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Bug triage results: 2026-05-26 [datafusion-comet]

Reply via email to