[PR] [SPARK-48615][SQL] Perf improvement for parsing hex string [spark]

via GitHub Thu, 13 Jun 2024 03:20:48 -0700


yaooqinn opened a new pull request, #46972:
URL: https://github.com/apache/spark/pull/46972

### What changes were proposed in this pull request?

Currently, we use two heximal string parsing functions. One uses Apache
Codecs Hex for X-prefixed lit parsing, and the other use builtin impl for unhex
function. I did a benchmark for them comparing with the `java.util.HexFormat`
which was introduced in JDK17.

```
OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
Apple M2 Max
Cardinality 1000000: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------
Apache 5050 5100
86 0.2 5050.1 1.0X
Spark 3822 3840
30 0.3 3821.6 1.3X
Java 2462 2522
87 0.4 2462.1 2.1X

OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
Apple M2 Max
Cardinality 2000000: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------
Apache 10020 10828
1154 0.2 5010.1 1.0X
Spark 6875 6966
144 0.3 3437.7 1.5X
Java 4999 5092
89 0.4 2499.3 2.0X

OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
Apple M2 Max
Cardinality 4000000: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------
Apache 20090 20433
433 0.2 5022.5 1.0X
Spark 13389 13620
229 0.3 3347.2 1.5X
Java 10023 10069
42 0.4 2505.6 2.0X

OpenJDK 64-Bit Server VM 17.0.10+0 on Mac OS X 14.5
Apple M2 Max
Cardinality 8000000: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative

------------------------------------------------------------------------------------------------------------------------
Apache 40277 43453
2755 0.2 5034.7 1.0X
Spark 27145 27380
311 0.3 3393.1 1.5X
Java 19980 21198
1473 0.4 2497.5 2.0X
```

The results indicate that the speed is Apache Codecs < builtin < Java,
increasing by ~50%.

In this PR, we replace these two with the Java 17 API

### Why are the changes needed?

performance enhance

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

benchmarking

existing unit tests in
org.apache.spark.sql.catalyst.expressions.MathExpressionsSuite

### Was this patch authored or co-authored using generative AI tooling?

no

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48615][SQL] Perf improvement for parsing hex string [spark]

Reply via email to