[jira] [Updated] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows

Martin Rueckl (Jira) Fri, 26 Jan 2024 02:37:04 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Martin Rueckl updated SPARK-46876:
----------------------------------
    Description: 
When reading a tab separated file that contains lines that only contain tabs 
(i.e. empty strings as values of the columns for that row), then these rows 
will silently be skipped (as empty lines) and the resulting dataframe will have 
less rows than expected.

This behavior is inconsistent with the behavior for e.g. semicolon separated 
files, where the resulting dataframe will have a row with only empty string 
values.

A minimal reproducible example would look like:

A minimal reproducible example: A file containing this
{code:java}
a\tb\tc\r\n
\t\t\r\n
1\t2\t3{code}
will create a dataframe with one row (a=1,b=2,c=3)
whereas this
{code:java}
a;b;c\r\n
;;\r\n
1;2;3{code}
will read as two rows (first row contains empty strings)

I used the following pyspark command to read the dataframes
{code:java}
 spark.read.option("header","true").option("sep","\t").csv("<tabseparated 
file>").collect()
spark.read.option("header","true").option("sep",";").csv("<semicolon 
file>").collect()
{code}
I ran into this particularly on databricks (I assume they use the same reader), 
but [this stack overflow 
post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858]
 indicates, that this is an old issue that may have been taken over from 
databricks when their csv reader was adopted in SPARK-12420

I recommend to at least add a respective test case to the CSV reader.

 

Why is this behaviour a problem:
 * It violates some of the core assumptions
 ** a properly configured roundtrip via csv write/read should result in the 
same set of rows
 ** changing the csv separator (when everything is properly esacped) should 
have no effect

Potential resolutions:
 * When the configured delimiter consists of only whitespace
 ** deactivate the "skip empty line feature"
 ** or skip only lines that are completely empty (only a (carriage return) 
newline)
 * Change the skip empty line feature to only skip if the line is completely 
empty (only contains a newlin)
 ** this may break some user code that relies on the current behaviour

  was:
When reading a tab separated file that contains lines that only contain tabs 
(i.e. empty strings as values of the columns for that row), then these rows 
will silently be skipped (as empty lines) and the resulting dataframe will have 
less rows than expected.

This behavior is inconsistent with the behavior for e.g. semicolon separated 
files, where the resulting dataframe will have a row with only empty string 
values.

A minimal reproducible example would look like:

A minimal reproducible example: A file containing this
{code:java}
a\tb\tc\r\n
\t\t\r\n
1\t2\t3{code}
will create a dataframe with one row (a=1,b=2,c=3)
whereas this
{code:java}
a;b;c\r\n
;;\r\n
1;2;3{code}
will read as two rows (first row contains empty strings)

I used the following pyspark command to read the dataframes
{code:java}
 spark.read.option("header","true").option("sep","\t").csv("<tabseparated 
file>").collect()
spark.read.option("header","true").option("sep",";").csv("<semicolon 
file>").collect()
{code}
 

I ran into this particularly on databricks (I assume they use the same reader), 
but [this stack overflow 
post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858]
 indicates, that this is an old issue that may have been taken over from 
databricks when their csv reader was adopted in SPARK-12420

I recommend to at least add a respective test case to the CSV reader.


> Data is silently lost in Tab separated CSV with empty (whitespace) rows
> -----------------------------------------------------------------------
>
>                 Key: SPARK-46876
>                 URL: https://issues.apache.org/jira/browse/SPARK-46876
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 3.4.1
>            Reporter: Martin Rueckl
>            Priority: Critical
>
> When reading a tab separated file that contains lines that only contain tabs 
> (i.e. empty strings as values of the columns for that row), then these rows 
> will silently be skipped (as empty lines) and the resulting dataframe will 
> have less rows than expected.
> This behavior is inconsistent with the behavior for e.g. semicolon separated 
> files, where the resulting dataframe will have a row with only empty string 
> values.
> A minimal reproducible example would look like:
> A minimal reproducible example: A file containing this
> {code:java}
> a\tb\tc\r\n
> \t\t\r\n
> 1\t2\t3{code}
> will create a dataframe with one row (a=1,b=2,c=3)
> whereas this
> {code:java}
> a;b;c\r\n
> ;;\r\n
> 1;2;3{code}
> will read as two rows (first row contains empty strings)
> I used the following pyspark command to read the dataframes
> {code:java}
>  spark.read.option("header","true").option("sep","\t").csv("<tabseparated 
> file>").collect()
> spark.read.option("header","true").option("sep",";").csv("<semicolon 
> file>").collect()
> {code}
> I ran into this particularly on databricks (I assume they use the same 
> reader), but [this stack overflow 
> post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858]
>  indicates, that this is an old issue that may have been taken over from 
> databricks when their csv reader was adopted in SPARK-12420
> I recommend to at least add a respective test case to the CSV reader.
>  
> Why is this behaviour a problem:
>  * It violates some of the core assumptions
>  ** a properly configured roundtrip via csv write/read should result in the 
> same set of rows
>  ** changing the csv separator (when everything is properly esacped) should 
> have no effect
> Potential resolutions:
>  * When the configured delimiter consists of only whitespace
>  ** deactivate the "skip empty line feature"
>  ** or skip only lines that are completely empty (only a (carriage return) 
> newline)
>  * Change the skip empty line feature to only skip if the line is completely 
> empty (only contains a newlin)
>  ** this may break some user code that relies on the current behaviour



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows

Reply via email to