This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new cc3bf36c9f22 [SPARK-48432][SQL] Avoid unboxing integers in
UnivocityParser
cc3bf36c9f22 is described below
commit cc3bf36c9f22d54606f858f0f90008cff792c59d
Author: Vladimir Golubev <[email protected]>
AuthorDate: Tue May 28 08:55:39 2024 +0900
[SPARK-48432][SQL] Avoid unboxing integers in UnivocityParser
### What changes were proposed in this pull request?
`tokenIndexArr` is created as an array of `java.lang.Integers`. However, it
is used not only for the wrapped java parser, but also during parsing to
identify the correct token index.
### Why are the changes needed?
This noticeably improves CSV parsing performance
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`testOnly org.apache.spark.sql.catalyst.csv.UnivocityParserSuite`
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #46759 from
vladimirg-db/vladimirg-db/avoid-unboxing-in-univocity-parser.
Authored-by: Vladimir Golubev <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
.../scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
index 4d95097e1681..61c2f7a5926b 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -63,8 +63,7 @@ class UnivocityParser(
private type ValueConverter = String => Any
// This index is used to reorder parsed tokens
- private val tokenIndexArr =
- requiredSchema.map(f =>
java.lang.Integer.valueOf(dataSchema.indexOf(f))).toArray
+ private val tokenIndexArr = requiredSchema.map(f =>
dataSchema.indexOf(f)).toArray
// True if we should inform the Univocity CSV parser to select which fields
to read by their
// positions. Generally assigned by input configuration options, except when
input column(s) have
@@ -81,7 +80,8 @@ class UnivocityParser(
// When to-be-parsed schema is shorter than the to-be-read data schema, we
let Univocity CSV
// parser select a sequence of fields for reading by their positions.
if (parsedSchema.length < dataSchema.length) {
- parserSetting.selectIndexes(tokenIndexArr: _*)
+ // Box into Integer here to avoid unboxing where `tokenIndexArr` is used
during parsing
+
parserSetting.selectIndexes(tokenIndexArr.map(java.lang.Integer.valueOf(_)): _*)
}
new CsvParser(parserSetting)
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]