(spark) branch master updated: [SPARK-48432][SQL] Avoid unboxing integers in UnivocityParser

gurwls223 Mon, 27 May 2024 16:55:56 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new cc3bf36c9f22 [SPARK-48432][SQL] Avoid unboxing integers in 
UnivocityParser
cc3bf36c9f22 is described below

commit cc3bf36c9f22d54606f858f0f90008cff792c59d
Author: Vladimir Golubev <[email protected]>
AuthorDate: Tue May 28 08:55:39 2024 +0900

    [SPARK-48432][SQL] Avoid unboxing integers in UnivocityParser
    
    ### What changes were proposed in this pull request?
    `tokenIndexArr` is created as an array of `java.lang.Integers`. However, it 
is used not only for the wrapped java parser, but also during parsing to 
identify the correct token index.
    
    ### Why are the changes needed?
    This noticeably improves CSV parsing performance
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    `testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite`
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #46759 from 
vladimirg-db/vladimirg-db/avoid-unboxing-in-univocity-parser.
    
    Authored-by: Vladimir Golubev <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 .../scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala   | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index 4d95097e1681..61c2f7a5926b 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -63,8 +63,7 @@ class UnivocityParser(
   private type ValueConverter = String => Any
 
   // This index is used to reorder parsed tokens
-  private val tokenIndexArr =
-    requiredSchema.map(f => 
java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray
+  private val tokenIndexArr = requiredSchema.map(f => 
dataSchema.indexOf(f)).toArray
 
   // True if we should inform the Univocity CSV parser to select which fields 
to read by their
   // positions. Generally assigned by input configuration options, except when 
input column(s) have
@@ -81,7 +80,8 @@ class UnivocityParser(
     // When to-be-parsed schema is shorter than the to-be-read data schema, we 
let Univocity CSV
     // parser select a sequence of fields for reading by their positions.
     if (parsedSchema.length < dataSchema.length) {
-      parserSetting.selectIndexes(tokenIndexArr: _*)
+      // Box into Integer here to avoid unboxing where `tokenIndexArr` is used 
during parsing
+      
parserSetting.selectIndexes(tokenIndexArr.map(java.lang.Integer.valueOf(_)): _*)
     }
     new CsvParser(parserSetting)
   }


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-48432][SQL] Avoid unboxing integers in UnivocityParser

Reply via email to