This is an automated email from the ASF dual-hosted git repository.
beliefer pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new f350ddba682 [SPARK-46403][SQL] Decode parquet binary with
getBytesUnsafe method
f350ddba682 is described below
commit f350ddba68264f05c1029ffb5cbebbeead7e275f
Author: Kun Wan <[email protected]>
AuthorDate: Sat Dec 16 10:28:51 2023 +0800
[SPARK-46403][SQL] Decode parquet binary with getBytesUnsafe method
### What changes were proposed in this pull request?
Now spark will get a parquet binary object with getBytes() method.
The **Binary.getBytes()** method will always make a new copy of the
internal bytes.
We can use **Binary.getBytesUnsafe()** method in
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L55-L62
to get the cached bytes if it has already been called getBytes() and has the
cached bytes.
Local benchmark, before this PR:
```
OpenJDK 64-Bit Server VM 17.0.6+10-LTS on Mac OS X 13.6
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Parquet dictionary: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Read binary dictionary 18919 19449
393 5.5 180.4 1.0X
```
after this PR:
```
OpenJDK 64-Bit Server VM 17.0.6+10-LTS on Mac OS X 13.6
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Parquet dictionary: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Read binary dictionary 10135 10602
337 10.3 96.7 1.0X
```
### Why are the changes needed?
Optimize parquet reader.
Before this PR:

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Local test
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #44351 from wankunde/binary.
Authored-by: Kun Wan <[email protected]>
Signed-off-by: Jiaan Geng <[email protected]>
---
.../spark/sql/execution/datasources/parquet/ParquetDictionary.java | 2 +-
.../sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
index 6626f3fee98..ffdb94ee376 100644
---
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
+++
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
@@ -70,7 +70,7 @@ public final class ParquetDictionary implements Dictionary {
long signed = dictionary.decodeToLong(id);
return new BigInteger(Long.toUnsignedString(signed)).toByteArray();
} else {
- return dictionary.decodeToBinary(id).getBytes();
+ return dictionary.decodeToBinary(id).getBytesUnsafe();
}
}
}
diff --git
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
index 1ed6c4329eb..918f21716f4 100644
---
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
+++
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
@@ -969,7 +969,7 @@ public class ParquetVectorUpdaterFactory {
int offset,
WritableColumnVector values,
VectorizedValuesReader valuesReader) {
- values.putByteArray(offset,
valuesReader.readBinary(arrayLen).getBytes());
+ values.putByteArray(offset,
valuesReader.readBinary(arrayLen).getBytesUnsafe());
}
@Override
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]