[
https://issues.apache.org/jira/browse/SPARK-51579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ángel Álvarez Pascua updated SPARK-51579:
-----------------------------------------
Affects Version/s: 3.5.0
> Spark CSV Read Low Performance: EOFExceptions in Univocity Parser
> -----------------------------------------------------------------
>
> Key: SPARK-51579
> URL: https://issues.apache.org/jira/browse/SPARK-51579
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.0, 4.0.0
> Reporter: Ángel Álvarez Pascua
> Priority: Minor
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> When Spark reads a CSV file, it drops new line characters before parsing the
> content in the
> {{org.apache.spark.sql.catalyst.csv.UnivocityParser.parseIterator}} method.
> During parsing, {{UnivocityParser}} expects a delimiter or a new line
> character. However, with new lines removed, it internally throws (and later
> ignores) an {{EOFException}} for each line.
> h4. *Impact:*
> * The repeated generation of {{EOFException}} instances is an expensive
> operation in the JVM.
> * This leads to significant performance degradation during CSV file loading.
> h4. *Expected Behavior:*
> * Spark should handle new line characters appropriately to prevent excessive
> exception generation.
> * Optimizing this behavior would improve overall CSV parsing performance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]