Mazen-Ghanaym opened a new pull request, #3000:
URL: https://github.com/apache/datafusion-comet/pull/3000
## Which issue does this PR close?
Closes #2973.
## Rationale for this change
The `startsWith` and `endsWith` string functions were previously delegated
to DataFusion's built-in scalar functions, which introduced unnecessary
overhead and did not fully leverage Comet's native execution capabilities. This
PR implements optimized native expressions to improve performance.
## What changes are included in this PR?
This PR introduces custom
[StartsWithExpr](cci:2://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/starts_ends_with.rs:30:0-33:1)
and
[EndsWithExpr](cci:2://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/starts_ends_with.rs:139:0-143:1)
physical expressions with the following optimizations:
**`startsWith`:**
- Uses Arrow's `compute::starts_with` kernel with a **pre-allocated pattern
array** to avoid per-batch allocations.
- Achieves **1.1X speedup** over Spark.
**`endsWith`:**
- Uses **direct buffer access** to the underlying `StringArray` data,
bypassing iterator overhead.
- Manually calculates suffix offsets and performs raw byte slice comparison
(`memcmp`).
- Achieves **1.0X parity** with Spark (improved from 0.9X regression).
**Files Changed:**
-
[native/spark-expr/src/string_funcs/starts_ends_with.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/starts_ends_with.rs:0:0-0:0)
(NEW)
-
[native/spark-expr/src/string_funcs/mod.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/mod.rs:0:0-0:0)
-
[native/core/src/execution/expressions/strings.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/core/src/execution/expressions/strings.rs:0:0-0:0)
-
[native/core/src/execution/planner/expression_registry.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/core/src/execution/planner/expression_registry.rs:0:0-0:0)
-
[spark/src/main/scala/org/apache/comet/serde/strings.scala](cci:7://file:///home/mazen/mytemp/datafusion-comet/spark/src/main/scala/org/apache/comet/serde/strings.scala:0:0-0:0)
-
[spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala](cci:7://file:///home/mazen/mytemp/datafusion-comet/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:0:0-0:0)
-
[spark/src/test/scala/org/apache/spark/sql/benchmark/CometStringExpressionBenchmark.scala](cci:7://file:///home/mazen/mytemp/datafusion-comet/spark/src/test/scala/org/apache/spark/sql/benchmark/CometStringExpressionBenchmark.scala:0:0-0:0)
## How are these changes tested?
1. **Existing Tests:** The implementation passes all existing Comet tests,
including TPC-DS and TPC-H correctness suites which exercise string functions.
2. **Benchmark Verification:** Performance was verified using
`CometStringExpressionBenchmark`:
- `startsWith`: **1.1X faster** than Spark (Comet 1887ms vs Spark 2028ms)
- `endsWith`: **1.0X parity** with Spark (Comet 3389ms vs Spark 3354ms)
3. **CI Verification:** A temporary workflow was used to verify the
benchmark executes correctly in GitHub Actions CI environment and the results
are in the **Benchmark Results**
### Benchmark Results
**Environment:** OpenJDK 64-Bit Server VM 11.0.29+7-LTS on Linux
6.11.0-1018-azure
**Processor:** AMD EPYC 7763 64-Core Processor
#### startsWith
| Case | Best Time(ms) | Avg Time(ms) | Stdev(ms) | Rate(M/s) | Per Row(ns)
| Relative |
|------|---------------|--------------|-----------|-----------|-------------|----------|
| Spark | 1657 | 1669 | 17 | 0.6 | 1580.2 | 1.0X |
| Comet (Scan) | 1740 | 1755 | 20 | 0.6 | 1659.6 | 1.0X |
| Comet (Scan + Exec) | 1546 | 1546 | 1 | 0.7 | 1474.1 | **1.1X** |
#### endsWith
| Case | Best Time(ms) | Avg Time(ms) | Stdev(ms) | Rate(M/s) | Per Row(ns)
| Relative |
|------|---------------|--------------|-----------|-----------|-------------|----------|
| Spark | 1625 | 1632 | 10 | 0.6 | 1549.9 | 1.0X |
| Comet (Scan) | 1731 | 1732 | 1 | 0.6 | 1651.2 | 0.9X |
| Comet (Scan + Exec) | 1562 | 1563 | 0 | 0.7 | 1490.0 | **1.0X** |
**Summary:**
- `startsWith`: **1.1X faster** than Spark (1546ms vs 1657ms)
- `endsWith`: **1.0X parity** with Spark (1562ms vs 1625ms)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]