[PR] Feat/optimize strings 2973 [datafusion-comet]

via GitHub Sat, 27 Dec 2025 09:27:19 -0800


Mazen-Ghanaym opened a new pull request, #3000:
URL: https://github.com/apache/datafusion-comet/pull/3000


   ## Which issue does this PR close?
   
   Closes #2973.
   
   ## Rationale for this change
   
   The `startsWith` and `endsWith` string functions were previously delegated 
to DataFusion's built-in scalar functions, which introduced unnecessary 
overhead and did not fully leverage Comet's native execution capabilities. This 
PR implements optimized native expressions to improve performance.
   
   ## What changes are included in this PR?
   
   This PR introduces custom 
[StartsWithExpr](cci:2://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/starts_ends_with.rs:30:0-33:1)
 and 
[EndsWithExpr](cci:2://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/starts_ends_with.rs:139:0-143:1)
 physical expressions with the following optimizations:
   
   **`startsWith`:**
   - Uses Arrow's `compute::starts_with` kernel with a **pre-allocated pattern 
array** to avoid per-batch allocations.
   - Achieves **1.1X speedup** over Spark.
   
   **`endsWith`:**
   - Uses **direct buffer access** to the underlying `StringArray` data, 
bypassing iterator overhead.
   - Manually calculates suffix offsets and performs raw byte slice comparison 
(`memcmp`).
   - Achieves **1.0X parity** with Spark (improved from 0.9X regression).
   
   **Files Changed:**
   - 
[native/spark-expr/src/string_funcs/starts_ends_with.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/starts_ends_with.rs:0:0-0:0)
 (NEW)
   - 
[native/spark-expr/src/string_funcs/mod.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/spark-expr/src/string_funcs/mod.rs:0:0-0:0)
   - 
[native/core/src/execution/expressions/strings.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/core/src/execution/expressions/strings.rs:0:0-0:0)
   - 
[native/core/src/execution/planner/expression_registry.rs](cci:7://file:///home/mazen/mytemp/datafusion-comet/native/core/src/execution/planner/expression_registry.rs:0:0-0:0)
   - 
[spark/src/main/scala/org/apache/comet/serde/strings.scala](cci:7://file:///home/mazen/mytemp/datafusion-comet/spark/src/main/scala/org/apache/comet/serde/strings.scala:0:0-0:0)
   - 
[spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala](cci:7://file:///home/mazen/mytemp/datafusion-comet/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:0:0-0:0)
   - 
[spark/src/test/scala/org/apache/spark/sql/benchmark/CometStringExpressionBenchmark.scala](cci:7://file:///home/mazen/mytemp/datafusion-comet/spark/src/test/scala/org/apache/spark/sql/benchmark/CometStringExpressionBenchmark.scala:0:0-0:0)
   
   ## How are these changes tested?
   
   1. **Existing Tests:** The implementation passes all existing Comet tests, 
including TPC-DS and TPC-H correctness suites which exercise string functions.
   2. **Benchmark Verification:** Performance was verified using 
`CometStringExpressionBenchmark`:
      - `startsWith`: **1.1X faster** than Spark (Comet 1887ms vs Spark 2028ms)
      - `endsWith`: **1.0X parity** with Spark (Comet 3389ms vs Spark 3354ms)
   3. **CI Verification:** A temporary workflow was used to verify the 
benchmark executes correctly in GitHub Actions CI environment and the results 
are in the **Benchmark Results**
   ### Benchmark Results
   
   **Environment:** OpenJDK 64-Bit Server VM 11.0.29+7-LTS on Linux 
6.11.0-1018-azure  
   **Processor:** AMD EPYC 7763 64-Core Processor
   
   #### startsWith
   
   | Case | Best Time(ms) | Avg Time(ms) | Stdev(ms) | Rate(M/s) | Per Row(ns) 
| Relative |
   
|------|---------------|--------------|-----------|-----------|-------------|----------|
   | Spark | 1657 | 1669 | 17 | 0.6 | 1580.2 | 1.0X |
   | Comet (Scan) | 1740 | 1755 | 20 | 0.6 | 1659.6 | 1.0X |
   | Comet (Scan + Exec) | 1546 | 1546 | 1 | 0.7 | 1474.1 | **1.1X** |
   
   #### endsWith
   
   | Case | Best Time(ms) | Avg Time(ms) | Stdev(ms) | Rate(M/s) | Per Row(ns) 
| Relative |
   
|------|---------------|--------------|-----------|-----------|-------------|----------|
   | Spark | 1625 | 1632 | 10 | 0.6 | 1549.9 | 1.0X |
   | Comet (Scan) | 1731 | 1732 | 1 | 0.6 | 1651.2 | 0.9X |
   | Comet (Scan + Exec) | 1562 | 1563 | 0 | 0.7 | 1490.0 | **1.0X** |
   
   **Summary:**
   - `startsWith`: **1.1X faster** than Spark (1546ms vs 1657ms)
   - `endsWith`: **1.0X parity** with Spark (1562ms vs 1625ms)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Feat/optimize strings 2973 [datafusion-comet]

Reply via email to