[jira] [Commented] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows

Jie Han (Jira) Mon, 29 Jan 2024 17:25:23 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812098#comment-17812098
 ]


Jie Han commented on SPARK-46876:
---------------------------------

{{The reason is that before parsing the csv lines spark calls 
`CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which contains 
characters those <= ' '. I doubt that if it's neccessary to do this, because 
they may be the exactly data itself. }}

> Data is silently lost in Tab separated CSV with empty (whitespace) rows
> -----------------------------------------------------------------------
>
>                 Key: SPARK-46876
>                 URL: https://issues.apache.org/jira/browse/SPARK-46876
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.4.1
>            Reporter: Martin Rueckl
>            Priority: Critical
>
> When reading a tab separated file that contains lines that only contain tabs 
> (i.e. empty strings as values of the columns for that row), then these rows 
> will silently be skipped (as empty lines) and the resulting dataframe will 
> have less rows than expected.
> This behavior is inconsistent with the behavior for e.g. semicolon separated 
> files, where the resulting dataframe will have a row with only empty string 
> values.
> A minimal reproducible example would look like:
> A minimal reproducible example: A file containing this
> {code:java}
> a\tb\tc\r\n
> \t\t\r\n
> 1\t2\t3{code}
> will create a dataframe with one row (a=1,b=2,c=3)
> whereas this
> {code:java}
> a;b;c\r\n
> ;;\r\n
> 1;2;3{code}
> will read as two rows (first row contains empty strings)
> I used the following pyspark command to read the dataframes
> {code:java}
>  spark.read.option("header","true").option("sep","\t").csv("<tabseparated 
> file>").collect()
> spark.read.option("header","true").option("sep",";").csv("<semicolon 
> file>").collect()
> {code}
> I ran into this particularly on databricks (I assume they use the same 
> reader), but [this stack overflow 
> post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858]
>  indicates, that this is an old issue that may have been taken over from 
> databricks when their csv reader was adopted in SPARK-12420
> I recommend to at least add a respective test case to the CSV reader.
>  
> Why is this behaviour a problem:
>  * It violates some of the core assumptions
>  ** a properly configured roundtrip via csv write/read should result in the 
> same set of rows
>  ** changing the csv separator (when everything is properly esacped) should 
> have no effect
> Potential resolutions:
>  * When the configured delimiter consists of only whitespace
>  ** deactivate the "skip empty line feature"
>  ** or skip only lines that are completely empty (only a (carriage return) 
> newline)
>  * Change the skip empty line feature to only skip if the line is completely 
> empty (only contains a newlin)
>  ** this may break some user code that relies on the current behaviour



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows

Reply via email to