andygrove opened a new issue, #4548: URL: https://github.com/apache/datafusion-comet/issues/4548
Triage pass for issues labeled `requires-triage`. - **Date:** 2026-06-01 - **Issues processed:** 48 (42 triaged, 6 skipped, 0 failed) - **Priority counts applied:** `priority:critical` 11, `priority:high` 5, `priority:medium` 19, `priority:low` 7 - **Guide:** [docs/source/contributor-guide/bug_triage.md](https://github.com/apache/datafusion-comet/blob/main/docs/source/contributor-guide/bug_triage.md) Labels have already been applied and `requires-triage` removed from the triaged issues. Please spot-check the calls below and close this issue when satisfied. Correct any label directly on the affected issue. ## Triaged ### priority:critical - JVM codegen dispatcher miscompiles map-typed (MapType) output ([#4539](https://github.com/apache/datafusion-comet/issues/4539)) - Area labels: `area:expressions`, `area:ffi` - Rationale: silent wrong result (map key corrupted, runs natively with no fallback); decision-tree step 1. - [Bug] replace returns wrong result for empty-string search ([#4497](https://github.com/apache/datafusion-comet/issues/4497)) - Area labels: `area:expressions` - Rationale: silent wrong result vs Spark for empty search string. - [Bug] CAST(complex AS STRING) does not honour spark.sql.legacy.castComplexTypesToString.enabled ([#4492](https://github.com/apache/datafusion-comet/issues/4492)) - Area labels: `area:expressions` - Rationale: ignores a config and produces wrong cast output (guide lists config-ignoring as critical). - [Bug] array_max and array_min disagree with Spark on NaN ordering ([#4482](https://github.com/apache/datafusion-comet/issues/4482)) - Area labels: `area:expressions` - Rationale: silent wrong result for NaN-containing arrays. - [Bug] array_distinct / array_union / array_except do not canonicalize NaN like Spark ([#4481](https://github.com/apache/datafusion-comet/issues/4481)) - Area labels: `area:expressions` - Rationale: silent wrong result for NaN / signed-zero elements. - [Bug] str_to_map does not honour Spark 4.1.1 legacy.truncateForEmptyRegexSplit ([#4477](https://github.com/apache/datafusion-comet/issues/4477)) - Area labels: `area:expressions` - Rationale: ignores a Spark 4.1.1 config, silently diverging when it is set. - [Bug] decode ignores Spark 4.0 legacyCharsets and legacyErrorAction flags ([#4465](https://github.com/apache/datafusion-comet/issues/4465)) - Area labels: `area:expressions` - Rationale: returns NULL where Spark substitutes or raises, a silent divergence in default and legacy modes. - [Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion ([#4463](https://github.com/apache/datafusion-comet/issues/4463)) - Area labels: `area:expressions` - Rationale: silent wrong result for combining marks and NUL-deletion semantics. - [Bug] make_date does not throw under spark.sql.ansi.enabled=true ([#4451](https://github.com/apache/datafusion-comet/issues/4451)) - Area labels: `area:expressions` - Rationale: returns NULL instead of the Spark ANSI error, a silent divergence when ANSI is on. - [Bug] next_day trims whitespace from dayOfWeek; Spark does not ([#4450](https://github.com/apache/datafusion-comet/issues/4450)) - Area labels: `area:expressions` - Rationale: returns a date where Spark returns NULL, an unconditional silent wrong result. - [Bug] next_day does not throw under spark.sql.ansi.enabled=true ([#4449](https://github.com/apache/datafusion-comet/issues/4449)) - Area labels: `area:expressions` - Rationale: returns NULL instead of the Spark ANSI error, a silent divergence when ANSI is on. ### priority:high - CreateArray with nullability-divergent children panics in native make_array ([#4528](https://github.com/apache/datafusion-comet/issues/4528)) - Area labels: `area:expressions` - Rationale: native panic (assertion failure in make_array); decision-tree step 2. - ConstantColumnVector inputs fail Comet export with "Comet execution only takes Arrow Arrays" ([#4527](https://github.com/apache/datafusion-comet/issues/4527)) - Area labels: `area:ffi` - Rationale: unhandled exception on a supported path (partition / constant columns, e.g. OPTIMIZE). - native shuffle: get_string should not panic on non-UTF-8 bytes (use lossy decode) ([#4521](https://github.com/apache/datafusion-comet/issues/4521)) - Area labels: `area:shuffle` - Rationale: native panic in shuffle on non-UTF-8 string bytes. - CometScanRule: decline native V1 scans on object_store-unsupported filesystem schemes ([#4520](https://github.com/apache/datafusion-comet/issues/4520)) - Area labels: `area:scan` - Rationale: native scan crashes at execution on custom filesystem schemes instead of falling back. - [Bug] CAST(BinaryType AS StringType) uses unsafe from_utf8_unchecked (undefined behaviour) ([#4488](https://github.com/apache/datafusion-comet/issues/4488)) - Area labels: `area:expressions` - Rationale: Rust undefined behaviour / memory-safety risk on the cast path (see escalation note). ### priority:medium - Native scan file-read failures should surface as Spark's FAILED_READ_FILE.NO_HINT ([#4529](https://github.com/apache/datafusion-comet/issues/4529)) - Area labels: `area:scan` - Rationale: error-compatibility gap (raw native message and missing path) with a fallback workaround. - Deep AND/OR predicate chains overflow protobuf recursion limit when the serialized plan is re-parsed ([#4526](https://github.com/apache/datafusion-comet/issues/4526)) - Area labels: `area:expressions` - Rationale: query fails on deep chains, but the trigger (>100 operands) is uncommon and degrades to a clean error. - Revert transition-heavy stages to Spark row-based execution ([#4518](https://github.com/apache/datafusion-comet/issues/4518)) - Area labels: none - Rationale: performance optimization for stages that accumulate many C2R/R2C transitions. - Native divide-by-zero in a dispatched ScalaUDF surfaces CometNativeException instead of SparkArithmeticException ([#4517](https://github.com/apache/datafusion-comet/issues/4517)) - Area labels: `area:expressions` - Rationale: wrong exception class under ANSI (errors either way, only the surface differs). - CometProject and CometHashAggregate do not perform cross-sibling subexpression elimination over ScalaUDF ([#4516](https://github.com/apache/datafusion-comet/issues/4516)) - Area labels: `area:expressions`, `area:aggregation` - Rationale: result correct but UDF invoked N times instead of once, a performance gap for UDF-heavy queries. - DataFusion / DataFusion-Spark functions whose Arrow return type drifts from Spark catalyst's declared type ([#4515](https://github.com/apache/datafusion-comet/issues/4515)) - Area labels: `area:ffi`, `area:expressions` - Rationale: latent type-drift (masked by FFI re-stamping today) that errors when FFI hops are reduced. - map expression audit follow-ups (from #4478) ([#4505](https://github.com/apache/datafusion-comet/issues/4505)) - Area labels: `area:expressions` - Rationale: deferred audit follow-up tracker, mostly support-level / serde consistency work (see escalation note). - collection expression audit follow-ups (from #4473) ([#4504](https://github.com/apache/datafusion-comet/issues/4504)) - Area labels: `area:expressions` - Rationale: deferred audit follow-up tracker, mostly support-level / serde consistency work (see escalation note). - array expression audit follow-ups (from #4483) ([#4503](https://github.com/apache/datafusion-comet/issues/4503)) - Area labels: `area:expressions` - Rationale: deferred audit follow-up tracker, mostly support-level / serde consistency work (see escalation note). - date/time expression audit follow-ups (from #4448) ([#4502](https://github.com/apache/datafusion-comet/issues/4502)) - Area labels: `area:expressions` - Rationale: deferred audit follow-up tracker, mostly support-level / serde consistency work (see escalation note). - cast expression audit follow-ups (from #4493) ([#4501](https://github.com/apache/datafusion-comet/issues/4501)) - Area labels: `area:expressions` - Rationale: deferred audit follow-up tracker, mostly support-level / serde consistency work (see escalation note). - Math expression audit follow-ups (from #4486) ([#4500](https://github.com/apache/datafusion-comet/issues/4500)) - Area labels: `area:expressions` - Rationale: deferred audit follow-up tracker, mostly support-level / serde consistency work (see escalation note). - [Feature] CAST(MapType AS MapType) falls back even though native cast_map_to_map exists ([#4491](https://github.com/apache/datafusion-comet/issues/4491)) - Area labels: `area:expressions` - Rationale: missing cast support, falls back to Spark (correct but unaccelerated). - [Bug] try_mod falls back to Spark because CometRemainder rejects EvalMode.TRY ([#4484](https://github.com/apache/datafusion-comet/issues/4484)) - Area labels: `area:expressions` - Rationale: feature gap, falls back to Spark; result correct via fallback. - [Feature] support size() for MapType inputs ([#4472](https://github.com/apache/datafusion-comet/issues/4472)) - Area labels: `area:expressions` - Rationale: missing expression support with a Spark fallback. - [Feature] support concat() for BinaryType and ArrayType inputs ([#4471](https://github.com/apache/datafusion-comet/issues/4471)) - Area labels: `area:expressions` - Rationale: missing expression support with a Spark fallback. - [Bug] CometCaseConversionBase gates compat inside convert() instead of getSupportLevel ([#4467](https://github.com/apache/datafusion-comet/issues/4467)) - Area labels: `area:expressions` - Rationale: the allowIncompatible config is bypassed for upper/lower, a functional config bug. - [Bug] bit_length and octet_length error natively for BinaryType input instead of falling back ([#4464](https://github.com/apache/datafusion-comet/issues/4464)) - Area labels: `area:expressions` - Rationale: native execution error on binary input instead of a clean fallback; uncommon input, workaround exists. - Bound CometS3CredentialDispatcher cache via refcounted handle lifecycle ([#4456](https://github.com/apache/datafusion-comet/issues/4456)) - Area labels: `area:scan` - Rationale: unbounded cache growth on long-running JVMs (eventual OOM), a conditional degradation. ### priority:low - CI lint check passed, but then later jobs failed with lint errors ([#4545](https://github.com/apache/datafusion-comet/issues/4545)) - Area labels: `area:ci` - Rationale: CI/tooling lint inconsistency (see escalation note). - PlanDataInjector does N x M canInject calls per operator tree ([#4530](https://github.com/apache/datafusion-comet/issues/4530)) - Area labels: none - Rationale: minor micro-optimization, explicitly no behavior change. - Do another audit sweep for string collation differences ([#4496](https://github.com/apache/datafusion-comet/issues/4496)) - Area labels: `area:expressions` - Rationale: process / tooling task (audit sweep), no concrete defect identified. - [Doc] CAST has no explicit TimeType branch (Spark 4.1) ([#4490](https://github.com/apache/datafusion-comet/issues/4490)) - Area labels: `area:expressions` - Rationale: documentation / support-level gap; the fallback itself is correct. - [Doc] CAST collated-string handling on Spark 4.0+ is implicit and untested ([#4489](https://github.com/apache/datafusion-comet/issues/4489)) - Area labels: `area:expressions` - Rationale: documentation / test gap; current fallback behavior is correct. - [Bug] width_bucket bypasses CometExpressionSerde framework ([#4485](https://github.com/apache/datafusion-comet/issues/4485)) - Area labels: `area:expressions` - Rationale: serde-framework consistency refactor; no wrong result or crash. - [Doc] decode does not appear in auto-generated compatibility docs ([#4466](https://github.com/apache/datafusion-comet/issues/4466)) - Area labels: `area:expressions` - Rationale: documentation gap (decode wired via shim, not a serde). ## Escalations to consider - [Bug] CAST(BinaryType AS StringType) uses unsafe from_utf8_unchecked (undefined behaviour) ([#4488](https://github.com/apache/datafusion-comet/issues/4488)) - Labeled `priority:high` for memory safety. Per the guide's "high crash that also produces wrong results silently" trigger, undefined behaviour that could silently corrupt output may warrant `priority:critical`. - CI lint check passed, but then later jobs failed with lint errors ([#4545](https://github.com/apache/datafusion-comet/issues/4545)) - Labeled `priority:low`. Per the guide, a CI issue that consistently blocks PR merges should escalate to `priority:medium`. - Audit follow-up trackers ([#4505](https://github.com/apache/datafusion-comet/issues/4505), [#4504](https://github.com/apache/datafusion-comet/issues/4504), [#4503](https://github.com/apache/datafusion-comet/issues/4503), [#4502](https://github.com/apache/datafusion-comet/issues/4502), [#4501](https://github.com/apache/datafusion-comet/issues/4501), [#4500](https://github.com/apache/datafusion-comet/issues/4500)) - Each bundles many sub-items of mixed severity, including Spark 4.0+ non-default-collation correctness gaps that silently diverge. Labeled `priority:medium` as trackers; the reviewer may want to split the collation sub-items into standalone `priority:critical` issues. ## Skipped — needs more info - [EPIC] Support Spark interval types (CalendarInterval / YearMonthInterval / DayTimeInterval) and interval expressions ([#4540](https://github.com/apache/datafusion-comet/issues/4540)) - Open-ended EPIC umbrella; a single priority is a roadmap decision rather than a mechanical triage call. - [EPIC] Provide JVM/codegen-dispatch implementations for Incompatible expressions so they never fall back by default ([#4506](https://github.com/apache/datafusion-comet/issues/4506)) - Open-ended EPIC umbrella; a single priority is a roadmap decision rather than a mechanical triage call. - Discussion: Should Comet add geospatial (ST_*) function support? ([#4455](https://github.com/apache/datafusion-comet/issues/4455)) - Discussion / scope question needing community and maintainer input, not a triageable defect. - Bug triage results: 2026-05-26 ([#4441](https://github.com/apache/datafusion-comet/issues/4441)) - Prior triage summary issue (auto-labeled `requires-triage`); meta, awaiting human review and closure, not a bug. - Bug triage results: 2026-05-18 ([#4359](https://github.com/apache/datafusion-comet/issues/4359)) - Prior triage summary issue (auto-labeled `requires-triage`); meta, awaiting human review and closure, not a bug. - Bug triage results: 2026-05-11 ([#4287](https://github.com/apache/datafusion-comet/issues/4287)) - Prior triage summary issue (auto-labeled `requires-triage`); meta, awaiting human review and closure, not a bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
