[ 
https://issues.apache.org/jira/browse/SPARK-51579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ángel Álvarez Pascua updated SPARK-51579:
-----------------------------------------
    Affects Version/s: 3.5.0

> Spark CSV Read Low Performance: EOFExceptions in Univocity Parser
> -----------------------------------------------------------------
>
>                 Key: SPARK-51579
>                 URL: https://issues.apache.org/jira/browse/SPARK-51579
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0, 4.0.0
>            Reporter: Ángel Álvarez Pascua
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When Spark reads a CSV file, it drops new line characters before parsing the 
> content in the 
> {{org.apache.spark.sql.catalyst.csv.UnivocityParser.parseIterator}} method.
> During parsing, {{UnivocityParser}} expects a delimiter or a new line 
> character. However, with new lines removed, it internally throws (and later 
> ignores) an {{EOFException}} for each line.
> h4. *Impact:*
>  * The repeated generation of {{EOFException}} instances is an expensive 
> operation in the JVM.
>  * This leads to significant performance degradation during CSV file loading.
> h4. *Expected Behavior:*
>  * Spark should handle new line characters appropriately to prevent excessive 
> exception generation.
>  * Optimizing this behavior would improve overall CSV parsing performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to