comphead commented on code in PR #4550: URL: https://github.com/apache/datafusion-comet/pull/4550#discussion_r3337285635
##########
docs/source/user-guide/latest/expressions.md:
##########
@@ -17,354 +17,652 @@
under the License.
-->
-# Supported Spark Expressions
-
-Comet supports the following Spark expressions. See the [Comet Compatibility
Guide] for details on known
-incompatibilities and unsupported cases.
-
-All expressions are enabled by default, but most can be disabled by setting
-`spark.comet.expression.EXPRNAME.enabled=false`, where `EXPRNAME` is the
expression name as specified in
-the following tables, such as `Length`, or `StartsWith`. See the [Comet
Configuration Guide] for a full list
-of expressions that be disabled.
-
-## Conditional Expressions
-
-| Expression | SQL |
-| ---------- | ------------------------------------------- |
-| CaseWhen | `CASE WHEN expr THEN expr ELSE expr END` |
-| If | `IF(predicate_expr, true_expr, false_expr)` |
-
-## Predicate Expressions
-
-| Expression | SQL |
-| ------------------ | ------------- |
-| And | `AND` |
-| EqualTo | `=` |
-| EqualNullSafe | `<=>` |
-| GreaterThan | `>` |
-| GreaterThanOrEqual | `>=` |
-| ILike | `ILIKE` |
-| In | `IN` |
-| InSet | `IN (...)` |
-| IsNotNull | `IS NOT NULL` |
-| IsNull | `IS NULL` |
-| LessThan | `<` |
-| LessThanOrEqual | `<=` |
-| Not | `NOT` |
-| Or | `OR` |
-
-## String Functions
-
-| Expression |
-| --------------- |
-| Ascii |
-| BitLength |
-| Chr |
-| Concat |
-| ConcatWs |
-| Contains |
-| Decode |
-| EndsWith |
-| InitCap |
-| Left |
-| Length |
-| Like |
-| Lower |
-| OctetLength |
-| Reverse |
-| Right |
-| RLike |
-| Split |
-| StartsWith |
-| StringInstr |
-| StringRepeat |
-| StringReplace |
-| StringLPad |
-| StringRPad |
-| StringSpace |
-| StringTranslate |
-| StringTrim |
-| StringTrimBoth |
-| StringTrimLeft |
-| StringTrimRight |
-| Substring |
-| SubstringIndex |
-| Upper |
-
-## JSON Functions
-
-| Expression |
-| ------------- |
-| GetJsonObject |
-
-## Date/Time Functions
-
-| Expression | SQL |
-| ----------------- | ---------------------------- |
-| AddMonths | `add_months` |
-| ConvertTimezone | `convert_timezone` |
-| CurrentTimeZone | `current_timezone` |
-| DateAdd | `date_add` |
-| DateDiff | `datediff` |
-| DateFormat | `date_format` |
-| DateFromUnixDate | `date_from_unix_date` |
-| DateSub | `date_sub` |
-| DatePart | `date_part(field, source)` |
-| Days | `days` |
-| Extract | `extract(field FROM source)` |
-| FromUnixTime | `from_unixtime` |
-| Hour | `hour` |
-| LastDay | `last_day` |
-| LocalTimestamp | `localtimestamp` |
-| MakeDate | `make_date` |
-| MakeTime | `make_time` |
-| MakeTimestamp | `make_timestamp` |
-| MicrosToTimestamp | `timestamp_micros` |
-| MillisToTimestamp | `timestamp_millis` |
-| Minute | `minute` |
-| MonthsBetween | `months_between` |
-| NextDay | `next_day` |
-| Second | `second` |
-| TimestampSeconds | `timestamp_seconds` |
-| ToUnixTimestamp | `to_unix_timestamp` |
-| TruncDate | `trunc` |
-| TruncTimestamp | `date_trunc` |
-| UnixDate | `unix_date` |
-| UnixMicros | `unix_micros` |
-| UnixMillis | `unix_millis` |
-| UnixSeconds | `unix_seconds` |
-| UnixTimestamp | `unix_timestamp` |
-| Year | `year` |
-| Month | `month` |
-| DayOfMonth | `day`/`dayofmonth` |
-| DayOfWeek | `dayofweek` |
-| WeekDay | `weekday` |
-| DayOfYear | `dayofyear` |
-| WeekOfYear | `weekofyear` |
-| Quarter | `quarter` |
-| ToTime | `to_time` |
-| TryToTime | `try_to_time` |
-
-## Math Expressions
-
-| Expression | SQL |
-| -------------- | -------------- |
-| Abs | `abs` |
-| Acos | `acos` |
-| Acosh | `acosh` |
-| Add | `+` |
-| Asin | `asin` |
-| Asinh | `asinh` |
-| Atan | `atan` |
-| Atan2 | `atan2` |
-| Atanh | `atanh` |
-| Bin | `bin` |
-| BRound | `bround` |
-| Cbrt | `cbrt` |
-| Ceil | `ceil` |
-| Cos | `cos` |
-| Cosh | `cosh` |
-| Cot | `cot` |
-| Csc | `csc` |
-| Divide | `/` |
-| Exp | `exp` |
-| Expm1 | `expm1` |
-| Factorial | `factorial` |
-| Floor | `floor` |
-| Hex | `hex` |
-| IntegralDivide | `div` |
-| IsNaN | `isnan` |
-| Log | `log` |
-| Log2 | `log2` |
-| Log10 | `log10` |
-| Multiply | `*` |
-| Pi | `pi` |
-| Pow | `power` |
-| Rand | `rand` |
-| Randn | `randn` |
-| Remainder | `%` |
-| Rint | `rint` |
-| Round | `round` |
-| Sec | `sec` |
-| Signum | `signum` |
-| Sin | `sin` |
-| Sinh | `sinh` |
-| Sqrt | `sqrt` |
-| Subtract | `-` |
-| Tan | `tan` |
-| Tanh | `tanh` |
-| ToDegrees | `degrees` |
-| ToRadians | `radians` |
-| TryAdd | `try_add` |
-| TryDivide | `try_div` |
-| TryMultiply | `try_mul` |
-| TrySubtract | `try_sub` |
-| UnaryMinus | `-` |
-| Unhex | `unhex` |
-| WidthBucket | `width_bucket` |
-
-## Hashing Functions
-
-| Expression |
-| ----------- |
-| Crc32 |
-| Md5 |
-| Murmur3Hash |
-| Sha1 |
-| Sha2 |
-| XxHash64 |
-
-## Bitwise Expressions
-
-| Expression | SQL |
-| ------------------ | ----- |
-| BitwiseAnd | `&` |
-| BitwiseCount | |
-| BitwiseGet | |
-| BitwiseOr | `\|` |
-| BitwiseNot | `~` |
-| BitwiseXor | `^` |
-| ShiftLeft | `<<` |
-| ShiftRight | `>>` |
-| ShiftRightUnsigned | `>>>` |
-
-## Aggregate Expressions
-
-| Expression | SQL |
-| ------------- | ---------- |
-| Average | |
-| BitAndAgg | |
-| BitOrAgg | |
-| BitXorAgg | |
-| BoolAnd | `bool_and` |
-| BoolOr | `bool_or` |
-| CollectSet | |
-| Corr | |
-| Count | |
-| CountIf | `count_if` |
-| CovPopulation | |
-| CovSample | |
-| First | |
-| Last | |
-| Max | |
-| Min | |
-| StddevPop | |
-| StddevSamp | |
-| Sum | |
-| VariancePop | |
-| VarianceSamp | |
-
-## Window Functions
-
-```{warning}
-Window support is disabled by default due to known correctness issues.
Tracking issue: [#2721](https://github.com/apache/datafusion-comet/issues/2721).
-```
-
-Comet supports using the following aggregate functions within window contexts
with PARTITION BY and ORDER BY clauses.
-
-| Expression |
-| ---------- |
-| Count |
-| Max |
-| Min |
-| Sum |
-
-**Note:** Dedicated window functions such as `rank`, `dense_rank`,
`row_number`, `lag`, `lead`, `ntile`, `cume_dist`, `percent_rank`, and
`nth_value` are not currently supported and will fall back to Spark.
-
-## Array Expressions
-
-| Expression |
-| -------------- |
-| ArrayAppend |
-| ArrayCompact |
-| ArrayContains |
-| ArrayDistinct |
-| ArrayExcept |
-| ArrayFilter |
-| ArrayInsert |
-| ArrayIntersect |
-| ArrayJoin |
-| ArrayMax |
-| ArrayMin |
-| ArrayPosition |
-| ArrayRemove |
-| ArrayRepeat |
-| ArraysZip |
-| ArrayUnion |
-| ArraysOverlap |
-| CreateArray |
-| ElementAt |
-| Flatten |
-| GetArrayItem |
-| Size |
-| SortArray |
-
-## Map Expressions
-
-| Expression |
-| -------------- |
-| GetMapValue |
-| MapContainsKey |
-| MapEntries |
-| MapFromArrays |
-| MapFromEntries |
-| MapKeys |
-| MapValues |
-| StringToMap |
-
-## Struct Expressions
-
-| Expression |
-| -------------------- |
-| CreateNamedStruct |
-| GetArrayStructFields |
-| GetStructField |
-| JsonToStructs |
-| StructsToJson |
-
-## URL Functions
-
-| Expression |
-| ------------ |
-| TryUrlDecode |
-| UrlDecode |
-| UrlEncode |
-
-## Conversion Expressions
-
-| Expression |
-| ---------- |
-| Cast |
-
-## SortOrder
-
-| Expression |
-| ---------- |
-| NullsFirst |
-| NullsLast |
-| Ascending |
-| Descending |
-
-## Other
-
-| Expression |
-| ---------------------------- |
-| Alias |
-| AttributeReference |
-| BloomFilterMightContain |
-| Coalesce |
-| CheckOverflow |
-| KnownFloatingPointNormalized |
-| Literal |
-| MakeDecimal |
-| MonotonicallyIncreasingID |
-| NormalizeNaNAndZero |
-| PromotePrecision |
-| RegExpReplace |
-| ScalarSubquery |
-| SparkPartitionID |
-| ToPrettyString |
-| UnscaledValue |
-
-[Comet Configuration Guide]: configs.md
-[Comet Compatibility Guide]: compatibility/expressions/index.md
+# Spark Expression Support
+
+This page is the complete reference for how Apache Comet handles each Spark
built-in
+expression. Comet accelerates expressions either with a native (Rust)
implementation or by
+dispatching to a Spark-compatible codegen path. When an expression is not
supported, Comet
+transparently falls back to Spark for that part of the plan; results are
unaffected.
+
+Expressions marked ✅ Supported are enabled by default. Expressions marked ⚠️
Supported
+(caveats) include cases that are known to diverge from Spark; those cases fall
back to Spark
+by default and must be opted into per expression with
+`spark.comet.expression.EXPRNAME.allowIncompatible=true` (where `EXPRNAME` is
the Spark
+expression class name, for example `Cast`). There is no global opt-in.
+
+Most expressions can also be disabled with
`spark.comet.expression.EXPRNAME.enabled=false`, where
+`EXPRNAME` is the Spark expression class name (for example `Length` or
`StartsWith`). See the
+[Comet Configuration Guide](configs.md) for the full list.
+
+## Status legend
+
+| Status | Meaning
|
+| ---------------------- |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| ✅ Supported | Native or codegen path; compatible with Spark by
default.
|
+| ⚠️ Supported (caveats) | Works, but may diverge from Spark in some cases:
incompatible, flag-gated (`allowIncompatible`), or restricted to certain types.
See the [Compatibility Guide](compatibility/index.md). |
+| 🔜 Planned | Intended; tracked by an open issue or pull request.
|
+| 🚫 Out of scope | Deliberately not planned.
|
+
+## Out of scope
+
+Comet focuses acceleration on mainstream relational, string, datetime, math,
and collection
+expressions. Some Spark function families are **out of scope**: specialized
functionality with
+narrow real-world analytics use and high implementation cost. These will fall
back to Spark and
+are not on the roadmap:
+
+- **Probabilistic sketches and approximate top-k** (`kll_sketch_*`, `hll_*`,
`theta_*`, `count_min_sketch`, `bitmap_*`, `approx_top_k*`): specialized data
structures with exact-correctness traps.
+- **XML / XPath** (`from_xml`, `to_xml`, `schema_of_xml`, `xpath*`): legacy
text format, rare in accelerated workloads.
+- **Geospatial** (`st_*`): brand-new Spark 4.1 functionality, specialized.
+- **Avro / Protobuf codecs** (`from_avro`, `to_avro`, `from_protobuf`,
`to_protobuf`, `schema_of_avro`): format conversion belongs at the IO layer,
not expression evaluation.
+- **JVM reflection** (`java_method`, `reflect`): niche, and they invoke
arbitrary JVM methods (a security concern).
+- **CSV functions** (`from_csv`, `to_csv`, `schema_of_csv`): row-level CSV
parsing and formatting in expressions is niche and better handled at the data
source layer.
+- **UTF-8 validation** (`is_valid_utf8`, `make_valid_utf8`, `validate_utf8`,
`try_validate_utf8`): niche Spark 4.x string-validation helpers.
+- **File metadata** (`input_file_name`, `input_file_block_start`,
`input_file_block_length`): require scan-internal per-row file information,
outside the expression layer.
+- **Miscellaneous niche** (`histogram_numeric`, `version`, `sentences`,
`quote`, `uuid`): low-value or specialized functions with little benefit from
native acceleration.
+
+Note that `approx_count_distinct`, `approx_percentile` / `percentile_approx`,
`median`, and `mode`
+are _not_ out of scope: although approximate, they are mainstream and are
planned.
+
+The tables below list every Spark built-in expression with its current status.
+
+## agg_funcs
+
+| Function | Status | Notes
|
+| ----------------------- | ------ |
---------------------------------------------------------------- |
+| `any` | ✅ |
|
+| `any_value` | ✅ |
|
+| `approx_count_distinct` | 🔜 | tracking #4098
|
+| `approx_percentile` | 🔜 |
[#3189](https://github.com/apache/datafusion-comet/issues/3189) |
+| `array_agg` | 🔜 | Array aggregate (related to
`collect_list`, #2524) |
+| `avg` | ⚠️ | Interval types (YearMonth, DayTime) fall
back |
+| `bit_and` | ✅ |
|
+| `bit_or` | ✅ |
|
+| `bit_xor` | ✅ |
|
+| `bool_and` | ✅ |
|
+| `bool_or` | ✅ |
|
+| `collect_list` | 🔜 |
[#2524](https://github.com/apache/datafusion-comet/issues/2524) |
+| `collect_set` | ✅ |
|
+| `corr` | ✅ |
|
+| `count` | ✅ |
|
+| `count_if` | ✅ |
|
+| `covar_pop` | ✅ |
|
+| `covar_samp` | ✅ |
|
+| `every` | ✅ |
|
+| `first` | ✅ |
|
+| `first_value` | ✅ |
|
+| `grouping` | 🔜 | Grouping indicator for
ROLLUP/CUBE/GROUPING SETS |
+| `grouping_id` | 🔜 | Grouping indicator for
ROLLUP/CUBE/GROUPING SETS |
+| `kurtosis` | 🔜 | tracking #4098
|
+| `last` | ✅ |
|
+| `last_value` | ✅ |
|
+| `listagg` | 🔜 | String aggregation
|
+| `max` | ✅ |
|
+| `max_by` | 🔜 |
[#3841](https://github.com/apache/datafusion-comet/issues/3841) |
+| `mean` | ✅ |
|
+| `median` | 🔜 | tracking #4098
|
+| `min` | ✅ |
|
+| `min_by` | 🔜 |
[#3841](https://github.com/apache/datafusion-comet/issues/3841) |
+| `mode` | 🔜 |
[#3970](https://github.com/apache/datafusion-comet/issues/3970) |
+| `percentile` | 🔜 | #4542
|
+| `percentile_approx` | 🔜 |
[#3189](https://github.com/apache/datafusion-comet/issues/3189) |
+| `percentile_cont` | 🔜 | Percentile aggregate
|
+| `percentile_disc` | 🔜 | Percentile aggregate
|
+| `regr_avgx` | ✅ | Native: Spark rewrites to `Average` (tests
in #4551) |
+| `regr_avgy` | ✅ | Native: Spark rewrites to `Average` (tests
in #4551) |
+| `regr_count` | ✅ | Native: Spark rewrites to `Count` (tests
in #4551) |
+| `regr_intercept` | 🔜 | Falls back; can reuse
`covar_pop`/`var_pop` accumulators (#4552) |
+| `regr_r2` | 🔜 | Falls back; can reuse the `corr`
accumulator (#4552) |
+| `regr_slope` | 🔜 | Falls back; can reuse
`covar_pop`/`var_pop` accumulators (#4552) |
+| `regr_sxx` | 🔜 | Falls back; can reuse `var_pop`
accumulator (#4552) |
+| `regr_sxy` | 🔜 | Falls back; can reuse `covar_pop`
accumulator (#4552) |
+| `regr_syy` | 🔜 | Falls back; can reuse `var_pop`
accumulator (#4552) |
+| `skewness` | 🔜 | tracking #4098
|
+| `some` | ✅ |
|
+| `std` | ✅ |
|
+| `stddev` | ✅ |
|
+| `stddev_pop` | ✅ |
|
+| `stddev_samp` | ✅ |
|
+| `string_agg` | 🔜 | String aggregation (alias of `listagg`)
|
+| `sum` | ✅ |
|
+| `try_avg` | 🔜 | tracking #4098
|
+| `try_sum` | 🔜 | tracking #4098
|
+| `var_pop` | ✅ |
|
+| `var_samp` | ✅ |
|
+| `variance` | ✅ |
|
+
+---
+
+## array_funcs
+
+| Function | Status | Notes
|
+| ----------------- | ------ |
-------------------------------------------------------------------------------------------------------------------------
|
+| `array` | ✅ |
|
+| `array_append` | ⚠️ | On Spark 4.0+ rewrites to `array_insert`;
inherits its incompatibilities
|
+| `array_compact` | ✅ |
|
+| `array_contains` | ⚠️ | NaN-canonicalization may differ for
float/double arrays
([#4481](https://github.com/apache/datafusion-comet/issues/4481)) |
+| `array_distinct` | ⚠️ | NaN/signed-zero canonicalization may differ
([#4481](https://github.com/apache/datafusion-comet/issues/4481)) |
+| `array_except` | ⚠️ | Null handling and ordering may differ;
`Incompatible`, flag-gated
|
+| `array_insert` | ✅ |
|
+| `array_intersect` | ⚠️ | Result element order may differ when right
array is longer than left |
+| `array_join` | ⚠️ | Null handling may differ
([#3178](https://github.com/apache/datafusion-comet/issues/3178));
`Incompatible`, flag-gated |
+| `array_max` | ⚠️ | NaN ordering may differ for float/double
([#4482](https://github.com/apache/datafusion-comet/issues/4482))
|
+| `array_min` | ⚠️ | NaN ordering may differ for float/double
([#4482](https://github.com/apache/datafusion-comet/issues/4482))
|
+| `array_position` | ⚠️ | Falls back for binary/struct/map/null element
types |
+| `array_prepend` | 🔜 | Sibling of `array_append`
|
+| `array_remove` | ✅ |
|
+| `array_repeat` | ✅ |
|
+| `array_union` | ⚠️ | NaN/signed-zero canonicalization may differ
([#4481](https://github.com/apache/datafusion-comet/issues/4481)) |
+| `arrays_overlap` | ✅ |
|
+| `arrays_zip` | ✅ |
|
+| `element_at` | ⚠️ | Only `ArrayType` input; `MapType` input falls
back |
+| `flatten` | ⚠️ | Falls back for binary/struct/map child element
types |
+| `get` | ✅ |
|
+| `sequence` | 🔜 | #4538
|
+| `shuffle` | 🔜 | Random array shuffle
|
+| `slice` | ✅ | Native (#4149)
|
+| `sort_array` | ⚠️ | Incompatible under strict floating-point; falls
back for nested struct/null arrays |
+
+---
+
+## bitwise_funcs
+
+| Function | Status | Notes
|
+| -------------------- | ------ |
---------------------------------------------------- |
+| `&` | ✅ |
|
+| `<<` | ✅ |
|
+| `>>` | ✅ |
|
+| `>>>` | ✅ | Operator alias for `shiftrightunsigned`
(Spark 4.0+) |
+| `^` | ✅ |
|
+| `bit_count` | ✅ |
|
+| `bit_get` | ✅ |
|
+| `getbit` | ✅ |
|
+| `shiftright` | ✅ |
|
+| `shiftrightunsigned` | ✅ |
|
+| `\|` | ✅ |
|
+| `~` | ✅ |
|
+
+---
+
+## collection_funcs
+
+| Function | Status | Notes
|
+| ------------- | ------ |
--------------------------------------------------------------------------------------------------------------------------------
|
+| `array_size` | ⚠️ | Lowers to `size`; accelerated, but returns -1
instead of NULL for NULL input (#4560)
|
+| `cardinality` | ⚠️ | Alias for `size`; `MapType` input falls back
([#4472](https://github.com/apache/datafusion-comet/issues/4472))
|
+| `concat` | ⚠️ | Only `StringType` children;
`BinaryType`/`ArrayType` fall back
([#4471](https://github.com/apache/datafusion-comet/issues/4471)) |
+| `reverse` | ⚠️ | Array with `BinaryType` elements is `Incompatible`,
flag-gated ([#2763](https://github.com/apache/datafusion-comet/issues/2763)) |
+| `size` | ⚠️ | `MapType` input falls back
([#4472](https://github.com/apache/datafusion-comet/issues/4472))
|
+
+---
+
+## conditional_funcs
+
+| Function | Status | Notes |
+| ------------ | ------ | --------------------------------- |
+| `coalesce` | ✅ | |
+| `if` | ✅ | |
+| `ifnull` | ✅ | |
+| `nanvl` | 🔜 | #4538 |
+| `nullif` | ✅ | |
+| `nullifzero` | ✅ | Lowers to `if`/`=` (Spark 4.0+) |
+| `nvl` | ✅ | |
+| `nvl2` | ✅ | |
+| `when` | ✅ | |
+| `zeroifnull` | ✅ | Lowers to `coalesce` (Spark 4.0+) |
+
+---
+
+## conversion_funcs
+
+The type-name conversion functions (`bigint`, `binary`, `boolean`, `date`,
`decimal`, `double`, `float`, `int`, `smallint`, `string`, `timestamp`,
`tinyint`) are SQL aliases for `CAST(... AS <type>)` and share the support and
caveats of `cast`.
+
+| Function | Status | Notes
|
+| -------- | ------ |
------------------------------------------------------------------------------------------------------------------
|
+| `cast` | ⚠️ | Many type pairs supported; float-to-decimal rounding may
differ; see [Compatibility Guide](compatibility/index.md) |
+
+---
+
+## datetime_funcs
+
+| Function | Status | Notes
|
+| --------------------- | ------ |
------------------------------------------------------------------------------------------------------
|
+| `add_months` | ✅ |
|
+| `convert_timezone` | ✅ |
|
+| `curdate` | ✅ | Constant-folded to a literal (alias of
`current_date`) |
+| `current_date` | ✅ | Constant-folded to a literal before Comet
sees the plan |
+| `current_time` | 🔜 | Blocked on Spark 4.1 TIME type support
(#4288) |
+| `current_timestamp` | ✅ | Constant-folded to a literal before Comet
sees the plan |
+| `current_timezone` | ✅ |
|
+| `date_add` | ✅ |
|
+| `date_diff` | ✅ |
|
+| `date_format` | ✅ |
|
+| `date_from_unix_date` | ✅ |
|
+| `date_part` | ✅ |
|
+| `date_sub` | ✅ |
|
+| `date_trunc` | ✅ |
|
+| `dateadd` | ✅ |
|
+| `datediff` | ✅ |
|
+| `datepart` | ✅ |
|
+| `day` | ✅ |
|
+| `dayname` | 🔜 | #4544
|
+| `dayofmonth` | ✅ |
|
+| `dayofweek` | ✅ |
|
+| `dayofyear` | ✅ |
|
+| `extract` | ✅ |
|
+| `from_unixtime` | ✅ |
|
+| `from_utc_timestamp` | ⚠️ | Legacy zone forms (`GMT+1`, `PST`) throw a
native parse error |
+| `hour` | ✅ |
|
+| `last_day` | ✅ |
|
+| `localtimestamp` | ✅ |
|
+| `make_date` | ✅ |
|
+| `make_dt_interval` | 🔜 | #4541
|
+| `make_interval` | 🔜 | Produces legacy CalendarInterval; tracked by
#4540 |
+| `make_time` | 🔜 | Spark 4.1 TIME type; tracked by #4288
|
+| `make_timestamp` | ✅ |
|
+| `make_timestamp_ltz` | ⚠️ | 6-arg form runs via the codegen dispatcher;
2-arg `(date, time)` form (Spark 4.1 TIME type) falls back |
+| `make_timestamp_ntz` | ⚠️ | 6-arg form runs via the codegen dispatcher;
2-arg `(date, time)` form (Spark 4.1 TIME type) falls back |
+| `make_ym_interval` | 🔜 | #4541
|
+| `minute` | ✅ |
|
+| `month` | ✅ |
|
+| `monthname` | 🔜 | #4544
|
+| `months_between` | ✅ |
|
+| `next_day` | ✅ |
|
+| `now` | ✅ | Constant-folded to a literal (alias of
`current_timestamp`) |
+| `quarter` | ✅ |
|
+| `second` | ✅ |
|
+| `session_window` | 🔜 | Time-window grouping; tracked by #4553
|
+| `time_diff` | 🔜 | Spark 4.1 TIME type; tracked by #4288
|
+| `time_trunc` | 🔜 | Spark 4.1 TIME type; tracked by #4288
|
+| `timestamp_micros` | ✅ |
|
+| `timestamp_millis` | ✅ |
|
+| `timestamp_seconds` | ✅ |
|
+| `to_date` | ✅ | Rewrites to `Cast` (or `Cast(GetTimestamp)`
with a format) before Comet sees the plan |
+| `to_time` | 🔜 | Spark 4.1 TIME type; tracked by #4288
|
+| `to_timestamp` | ✅ | Rewrites to `Cast` (or `GetTimestamp` with a
format) before Comet sees the plan |
+| `to_timestamp_ltz` | ✅ | Rewrites to `to_timestamp` (`TimestampType`)
|
+| `to_timestamp_ntz` | ✅ | Rewrites to `to_timestamp`
(`TimestampNTZType`) |
+| `to_unix_timestamp` | ✅ |
|
+| `to_utc_timestamp` | ⚠️ | Legacy zone forms (`GMT+1`, `PST`) throw a
native parse error |
+| `trunc` | ✅ |
|
+| `try_make_interval` | 🔜 | Produces legacy CalendarInterval; tracked by
#4540 |
+| `try_make_timestamp` | ⚠️ | Runs natively for valid inputs, but returns
wrong values for invalid inputs instead of NULL (#4554) |
+| `try_to_date` | 🔜 | Rewrites to `Cast`/`GetTimestamp` but
currently falls back; tracked by #4556 |
+| `try_to_time` | 🔜 | Spark 4.1 TIME type; tracked by #4288
|
+| `try_to_timestamp` | 🔜 | Rewrites to `Cast`/`GetTimestamp` but
currently falls back; tracked by #4556 |
+| `unix_date` | ✅ |
|
+| `unix_micros` | ✅ |
|
+| `unix_millis` | ✅ |
|
+| `unix_seconds` | ✅ |
|
+| `unix_timestamp` | ✅ |
|
+| `weekday` | ✅ |
|
+| `weekofyear` | ✅ |
|
+| `window` | 🔜 | Time-window grouping; tracked by #4553
|
+| `window_time` | 🔜 | Time-window grouping; tracked by #4553
|
+| `year` | ✅ |
|
+
+---
+
+## generator_funcs
+
+`explode` and `posexplode` are supported via `CometExplodeExec`
(operator-level, not
+expression-level). The `outer` variants are wired but marked `Incompatible`;
they require
+`spark.comet.exec.explode.enabled=true` and `allowIncompatible`.
+
+| Function | Status | Notes
|
+| ------------------ | ------ |
---------------------------------------------------- |
+| `explode` | ✅ | via `CometExplodeExec`
|
+| `explode_outer` | ⚠️ | `outer=true` incompatible; needs
`allowIncompatible` |
+| `inline` | 🔜 | Operator-level generator (like `explode`)
|
+| `inline_outer` | 🔜 | Operator-level generator (like `explode`)
|
+| `posexplode` | ✅ | via `CometExplodeExec`
|
+| `posexplode_outer` | ⚠️ | `outer=true` incompatible; needs
`allowIncompatible` |
+| `stack` | 🔜 | Operator-level generator
|
+
+---
+
+## hash_funcs
+
+| Function | Status | Notes |
+| ---------- | ------ | ----- |
+| `crc32` | ✅ | |
+| `hash` | ✅ | |
+| `md5` | ✅ | |
+| `sha` | ✅ | |
+| `sha1` | ✅ | |
+| `sha2` | ✅ | |
+| `xxhash64` | ✅ | |
+
+---
+
+## json_funcs
+
+| Function | Status | Notes
|
+| ------------------- | ------ |
---------------------------------------------------------------------------------------------------------------------
|
+| `from_json` | ⚠️ | Partial native support (requires explicit
schema, marked `Incompatible`); fuller support via codegen dispatch (#4305) |
+| `get_json_object` | ⚠️ | Single-quoted JSON and unescaped control
chars require `allowIncompatible` |
+| `json_array_length` | 🔜 | tracking #4098
|
+| `json_object_keys` | 🔜 |
[#3161](https://github.com/apache/datafusion-comet/issues/3161)
|
+| `json_tuple` | 🔜 |
[#3160](https://github.com/apache/datafusion-comet/issues/3160)
|
+| `schema_of_json` | 🔜 |
[#3163](https://github.com/apache/datafusion-comet/issues/3163)
|
+| `to_json` | ⚠️ | Partial native support (options and map/array
inputs fall back); fuller support via codegen dispatch (#4305) |
+
+---
+
+## lambda_funcs
+
+All higher-order functions are planned via
[#4224](https://github.com/apache/datafusion-comet/issues/4224).
+
+| Function | Status | Notes
|
+| ------------------ | ------ |
--------------------------------------------------------------- |
+| `aggregate` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `array_sort` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `exists` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `filter` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `forall` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `map_filter` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `map_zip_with` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `reduce` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `transform` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `transform_keys` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `transform_values` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+| `zip_with` | 🔜 |
[#4224](https://github.com/apache/datafusion-comet/issues/4224) |
+
+---
+
+## map_funcs
+
+| Function | Status | Notes
|
+| ------------------ | ------ |
------------------------------------------------------------ |
+| `element_at` | ⚠️ | Only `ArrayType` input; `MapType` input falls
back |
+| `map` | 🔜 | Constructs a map
|
+| `map_concat` | 🔜 | Concatenates maps
|
+| `map_contains_key` | ✅ |
|
+| `map_entries` | ✅ |
|
+| `map_from_arrays` | ✅ |
|
+| `map_from_entries` | ⚠️ | `BinaryType` key/value falls back unless
`allowIncompatible` |
+| `map_keys` | ✅ |
|
+| `map_values` | ✅ |
|
+| `str_to_map` | ✅ |
|
+| `try_element_at` | ✅ | Lowers to `element_at`; array input (MapType
falls back) |
+
+---
+
+## math_funcs
+
+| Function | Status | Notes
|
+| -------------- | ------ |
------------------------------------------------------------------------------------------------------------------
|
+| `%` | ⚠️ | `try_mod` form (`EvalMode.TRY`) falls back
([#4484](https://github.com/apache/datafusion-comet/issues/4484)) |
+| `*` | ⚠️ | Interval multiplication falls back
|
+| `+` | ✅ |
|
+| `-` | ✅ |
|
+| `/` | ✅ |
|
+| `abs` | ⚠️ | Interval types fall back; ANSI overflow for
integer min value |
+| `acos` | ✅ |
|
+| `acosh` | ✅ |
|
+| `asin` | ✅ |
|
+| `asinh` | ✅ |
|
+| `atan` | ✅ |
|
+| `atan2` | ✅ |
|
+| `atanh` | ✅ |
|
+| `bin` | ✅ |
|
+| `bround` | 🔜 | #4538
|
+| `cbrt` | ✅ |
|
+| `ceil` | ⚠️ | Two-arg `ceil(expr, scale)` form falls back
|
+| `ceiling` | ✅ |
|
+| `conv` | 🔜 | #4538
|
+| `cos` | ✅ |
|
+| `cosh` | ✅ |
|
+| `cot` | ✅ |
|
+| `csc` | ✅ |
|
+| `degrees` | ✅ |
|
+| `div` | ✅ |
|
+| `e` | ✅ | Folds to a literal (like `pi`)
|
+| `exp` | ✅ |
|
+| `expm1` | ✅ |
|
+| `factorial` | ✅ |
|
+| `floor` | ⚠️ | Two-arg `floor(expr, scale)` form falls back
|
+| `greatest` | ✅ |
|
+| `hex` | ✅ |
|
+| `hypot` | 🔜 | #4538
|
+| `least` | ✅ |
|
+| `ln` | ✅ |
|
+| `log` | ✅ |
|
+| `log10` | ✅ |
|
+| `log1p` | 🔜 | #4538
|
+| `log2` | ✅ |
|
+| `mod` | ✅ |
|
+| `negative` | ✅ |
|
+| `pi` | ✅ |
|
+| `pmod` | 🔜 | #4538
|
+| `positive` | ✅ |
|
+| `pow` | ✅ |
|
+| `power` | ✅ |
|
+| `radians` | ✅ |
|
+| `rand` | ✅ |
|
+| `randn` | ✅ |
|
+| `random` | ✅ | Alias for `rand` (Spark 4.0+); seed must be a
literal |
+| `randstr` | 🔜 | Random string (Spark 4.0+)
|
+| `rint` | ✅ |
|
+| `round` | ⚠️ | Float/Double inputs always fall back;
integer/decimal HALF_UP supported |
+| `sec` | ✅ |
|
+| `shiftleft` | ✅ |
|
+| `sign` | ✅ |
|
+| `signum` | ✅ |
|
+| `sin` | ✅ |
|
+| `sinh` | ✅ |
|
+| `sqrt` | ✅ |
|
+| `tan` | ✅ |
|
+| `tanh` | ✅ |
|
+| `try_add` | ⚠️ | Datetime/interval form falls back; numeric form
supported |
+| `try_divide` | ✅ |
|
+| `try_mod` | 🔜 | Lowers to `Remainder` with TRY eval mode, which
falls back (#4484) |
+| `try_multiply` | ✅ |
|
+| `try_subtract` | ✅ |
|
+| `unhex` | ✅ |
|
+| `uniform` | ✅ | Constant-folded; literal arguments only (Spark
4.0+) |
+| `width_bucket` | ⚠️ | Wired via shim, bypasses support-level framework
([#4485](https://github.com/apache/datafusion-comet/issues/4485)) |
+
+---
+
+## misc_funcs
+
+| Function | Status | Notes
|
+| ----------------------------- | ------ |
--------------------------------------------------------------------------------
|
+| `aes_decrypt` | 🔜 | Falls back; `StaticInvoke` not
allowlisted; planned via codegen dispatch (#4558) |
+| `aes_encrypt` | 🔜 | Falls back; planned via codegen
dispatch (#4558); nondeterministic IV by default |
+| `assert_true` | 🔜 | Lowers to `RaiseError`, which falls
back |
+| `current_catalog` | ✅ | Resolved to a literal by the
analyzer (`ReplaceCurrentLike`) |
+| `current_database` | ✅ | Resolved to a literal by the
analyzer (`ReplaceCurrentLike`) |
+| `current_schema` | ✅ | Alias of `current_database`;
resolved to a literal by the analyzer |
+| `current_user` | ✅ | Resolved to a literal by the
analyzer; same as `user` |
+| `equal_null` | ✅ | Lowers to `<=>` (`EqualNullSafe`)
|
+| `is_variant_null` | 🔜 | tracking #4098
|
+| `monotonically_increasing_id` | ✅ |
|
+| `parse_json` | 🔜 | tracking #4098
|
+| `raise_error` | 🔜 | Raises a runtime error
|
+| `rand` | ✅ | Seed must be a literal
|
+| `randn` | ✅ | Seed must be a literal
|
+| `schema_of_variant` | 🔜 | tracking #4098
|
+| `schema_of_variant_agg` | 🔜 | tracking #4098
|
+| `session_user` | ✅ | Alias of `current_user`; resolved to
a literal by the analyzer |
+| `spark_partition_id` | ✅ |
|
+| `to_variant_object` | 🔜 | tracking #4098
|
+| `try_aes_decrypt` | 🔜 | Falls back; planned via codegen
dispatch (#4558) |
+| `try_parse_json` | 🔜 | tracking #4098
|
+| `try_variant_get` | 🔜 | tracking #4098
|
+| `typeof` | ✅ | Foldable; resolved to a literal
before Comet sees the plan |
+| `user` | ✅ | Resolved to a literal by the Spark
analyzer before reaching Comet |
+| `variant_get` | 🔜 | tracking #4098
|
+
+---
+
+## predicate_funcs
+
+| Function | Status | Notes
|
+| ------------- | ------ |
---------------------------------------------------------------------------------------------
|
+| `!` | ✅ |
|
+| `<` | ✅ |
|
+| `<=` | ✅ |
|
+| `<=>` | ✅ |
|
+| `=` | ✅ |
|
+| `==` | ✅ |
|
+| `>` | ✅ |
|
+| `>=` | ✅ |
|
+| `and` | ✅ |
|
+| `between` | ✅ |
|
+| `ilike` | ✅ |
|
+| `in` | ✅ |
|
+| `isnan` | ✅ |
|
+| `isnotnull` | ✅ |
|
+| `isnull` | ✅ |
|
+| `like` | ✅ |
|
+| `not` | ✅ |
|
+| `or` | ✅ |
|
+| `regexp` | ⚠️ | Alias for `rlike`; uses Rust `regex` crate,
requires `allowIncompatible` |
+| `regexp_like` | ⚠️ | Alias for `rlike`; uses Rust `regex` crate,
requires `allowIncompatible` |
+| `rlike` | ⚠️ | Uses Rust `regex` crate; requires
`allowIncompatible`; results may differ from Java `Pattern` |
+
+---
+
+## string_funcs
+
+| Function | Status | Notes
|
+| -------------------- | ------ |
--------------------------------------------------------------------------------
|
+| `ascii` | ✅ |
|
+| `base64` | 🔜 | Lowers to `StaticInvoke(encode)` (not
allowlisted); falls back |
+| `bit_length` | ✅ |
|
+| `btrim` | ✅ |
|
+| `char` | ✅ |
|
+| `char_length` | ✅ |
|
+| `character_length` | ✅ |
|
+| `chr` | ✅ |
|
+| `collate` | 🔜 | Spark collation (umbrella #2190)
|
+| `collation` | ✅ | Constant-folded to a literal (Spark 4.0+)
|
+| `concat_ws` | ✅ |
|
+| `contains` | ✅ |
|
+| `decode` | ✅ |
|
+| `elt` | 🔜 | #4538
|
+| `encode` | 🔜 | Lowers to `StaticInvoke(encode)` (not
allowlisted); falls back |
+| `endswith` | ✅ |
|
+| `find_in_set` | 🔜 | #4538
|
+| `format_number` | 🔜 | #4538
|
+| `format_string` | 🔜 | #4538
|
+| `initcap` | ✅ |
|
+| `instr` | ✅ |
|
+| `lcase` | ✅ |
|
+| `left` | ✅ |
|
+| `len` | ✅ |
|
+| `length` | ✅ |
|
+| `levenshtein` | 🔜 | #4538
|
+| `locate` | 🔜 | #4538
|
+| `lower` | ✅ |
|
+| `lpad` | ✅ |
|
+| `ltrim` | ✅ |
|
+| `luhn_check` | ✅ | Native via `StaticInvoke` (tests:
luhn_check.sql) |
+| `mask` | 🔜 | Data masking
|
+| `octet_length` | ✅ |
|
+| `overlay` | 🔜 | #4538
|
+| `position` | 🔜 | #4538
|
+| `printf` | 🔜 | #4538
|
+| `regexp_count` | 🔜 | tracking #4098
|
+| `regexp_extract` | 🔜 | tracking #4098
|
+| `regexp_extract_all` | 🔜 | tracking #4098
|
+| `regexp_instr` | 🔜 | tracking #4098
|
+| `regexp_replace` | ✅ |
|
+| `regexp_substr` | 🔜 | tracking #4098
|
+| `repeat` | ✅ |
|
+| `replace` | ✅ |
|
+| `right` | ✅ |
|
+| `rpad` | ✅ |
|
+| `rtrim` | ✅ |
|
+| `soundex` | 🔜 | #4538
|
+| `space` | ✅ |
|
+| `split` | ✅ |
|
+| `split_part` | 🔜 | Lowers to `element_at(StringSplitSQL(...))`;
`StringSplitSQL` falls back (#4561) |
+| `startswith` | ✅ |
|
+| `substr` | ✅ |
|
+| `substring` | ✅ |
|
+| `substring_index` | ✅ |
|
+| `to_binary` | ⚠️ | Only the hex format is accelerated (lowers
to `unhex`); UTF-8/base64 fall back |
+| `to_char` | 🔜 | #4538
|
+| `to_number` | 🔜 | #4538
|
+| `to_varchar` | 🔜 | #4538
|
+| `translate` | ✅ |
|
+| `trim` | ✅ |
|
+| `try_to_binary` | 🔜 | Lowers to `TryEval(...)`, which falls back
|
+| `try_to_number` | 🔜 | TRY variant of `to_number`
|
+| `ucase` | ✅ |
|
+| `unbase64` | 🔜 | #4538
|
+| `upper` | ✅ |
|
+
+---
+
+## struct_funcs
+
+| Function | Status | Notes |
+| -------------- | ------ | ---------------------------------------- |
+| `named_struct` | ⚠️ | Duplicate field names fall back to Spark |
+| `struct` | ✅ | |
+
+---
+
+## url_funcs
+
+| Function | Status | Notes |
+| ---------------- | ------ | ----- |
+| `parse_url` | ✅ | |
+| `try_url_decode` | ✅ | |
+| `url_decode` | ✅ | |
+| `url_encode` | ✅ | |
+
+---
+
+## window_funcs
+
+Window functions run via `CometWindowExec`. Window support is disabled by
default due to known
+correctness issues (tracking
[#2721](https://github.com/apache/datafusion-comet/issues/2721)).
+When enabled, `lag` and `lead` are explicitly wired; aggregate window
functions (`count`, `min`,
+`max`, `sum`) are also supported. Ranking functions (`rank`, `dense_rank`,
`row_number`,
+`ntile`, `percent_rank`, `cume_dist`, `nth_value`) are not yet wired in the
window serde and
+fall back to Spark.
+
+| Function | Status | Notes |
+| -------------- | ------ | --------------------------------- |
+| `cume_dist` | 🔜 | Window function; tracked by #2721 |
+| `dense_rank` | 🔜 | Window function; tracked by #2721 |
+| `lag` | ✅ | via `CometWindowExec` |
+| `lead` | ✅ | via `CometWindowExec` |
+| `nth_value` | 🔜 | Window function; tracked by #2721 |
+| `ntile` | 🔜 | Window function; tracked by #2721 |
+| `percent_rank` | 🔜 | Window function; tracked by #2721 |
+| `rank` | 🔜 | Window function; tracked by #2721 |
+| `row_number` | 🔜 | Window function; tracked by #2721 |
+
+---
+
+## Out-of-scope function list
Review Comment:
isn't it duplicate?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
