(spark) branch master updated: [SPARK-46403][SQL] Decode parquet binary with getBytesUnsafe method

beliefer Fri, 15 Dec 2023 18:29:35 -0800

This is an automated email from the ASF dual-hosted git repository.

beliefer pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new f350ddba682 [SPARK-46403][SQL] Decode parquet binary with 
getBytesUnsafe method
f350ddba682 is described below

commit f350ddba68264f05c1029ffb5cbebbeead7e275f
Author: Kun Wan <[email protected]>
AuthorDate: Sat Dec 16 10:28:51 2023 +0800

    [SPARK-46403][SQL] Decode parquet binary with getBytesUnsafe method
    
    ### What changes were proposed in this pull request?
    
    Now spark will get a parquet binary object with getBytes() method.
    
    The **Binary.getBytes()** method will always make a new copy of the 
internal bytes.
    
    We can use **Binary.getBytesUnsafe()** method  in  
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L55-L62
 to get the cached bytes if it has already been called getBytes() and has the 
cached bytes.
    
    Local benchmark, before this PR:
    ```
    OpenJDK 64-Bit Server VM 17.0.6+10-LTS on Mac OS X 13.6
    Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
    Parquet dictionary:                       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------------------------------
    Read binary dictionary                            18919          19449      
   393          5.5         180.4       1.0X
    ```
    after this PR:
    ```
    OpenJDK 64-Bit Server VM 17.0.6+10-LTS on Mac OS X 13.6
    Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
    Parquet dictionary:                       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------------------------------
    Read binary dictionary                            10135          10602      
   337         10.3          96.7       1.0X
    
    ```
    
    ### Why are the changes needed?
    
    Optimize parquet reader.
    Before this PR:
    
![image](https://github.com/apache/spark/assets/3626747/c61b86f7-8745-46d6-a5f7-34339f57f11a)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Local test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #44351 from wankunde/binary.
    
    Authored-by: Kun Wan <[email protected]>
    Signed-off-by: Jiaan Geng <[email protected]>
---
 .../spark/sql/execution/datasources/parquet/ParquetDictionary.java      | 2 +-
 .../sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
index 6626f3fee98..ffdb94ee376 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetDictionary.java
@@ -70,7 +70,7 @@ public final class ParquetDictionary implements Dictionary {
       long signed = dictionary.decodeToLong(id);
       return new BigInteger(Long.toUnsignedString(signed)).toByteArray();
     } else {
-      return dictionary.decodeToBinary(id).getBytes();
+      return dictionary.decodeToBinary(id).getBytesUnsafe();
     }
   }
 }
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
index 1ed6c4329eb..918f21716f4 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java
@@ -969,7 +969,7 @@ public class ParquetVectorUpdaterFactory {
         int offset,
         WritableColumnVector values,
         VectorizedValuesReader valuesReader) {
-      values.putByteArray(offset, 
valuesReader.readBinary(arrayLen).getBytes());
+      values.putByteArray(offset, 
valuesReader.readBinary(arrayLen).getBytesUnsafe());
     }
 
     @Override


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-46403][SQL] Decode parquet binary with getBytesUnsafe method

Reply via email to