[
https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812098#comment-17812098
]
Jie Han commented on SPARK-46876:
---------------------------------
{{The reason is that before parsing the csv lines spark calls
`CSVExprUtils.filterCommentAndEmpty` to filter `empty` lines which contains
characters those <= ' '. I doubt that if it's neccessary to do this, because
they may be the exactly data itself. }}
> Data is silently lost in Tab separated CSV with empty (whitespace) rows
> -----------------------------------------------------------------------
>
> Key: SPARK-46876
> URL: https://issues.apache.org/jira/browse/SPARK-46876
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 3.4.1
> Reporter: Martin Rueckl
> Priority: Critical
>
> When reading a tab separated file that contains lines that only contain tabs
> (i.e. empty strings as values of the columns for that row), then these rows
> will silently be skipped (as empty lines) and the resulting dataframe will
> have less rows than expected.
> This behavior is inconsistent with the behavior for e.g. semicolon separated
> files, where the resulting dataframe will have a row with only empty string
> values.
> A minimal reproducible example would look like:
> A minimal reproducible example: A file containing this
> {code:java}
> a\tb\tc\r\n
> \t\t\r\n
> 1\t2\t3{code}
> will create a dataframe with one row (a=1,b=2,c=3)
> whereas this
> {code:java}
> a;b;c\r\n
> ;;\r\n
> 1;2;3{code}
> will read as two rows (first row contains empty strings)
> I used the following pyspark command to read the dataframes
> {code:java}
> spark.read.option("header","true").option("sep","\t").csv("<tabseparated
> file>").collect()
> spark.read.option("header","true").option("sep",";").csv("<semicolon
> file>").collect()
> {code}
> I ran into this particularly on databricks (I assume they use the same
> reader), but [this stack overflow
> post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858]
> indicates, that this is an old issue that may have been taken over from
> databricks when their csv reader was adopted in SPARK-12420
> I recommend to at least add a respective test case to the CSV reader.
>
> Why is this behaviour a problem:
> * It violates some of the core assumptions
> ** a properly configured roundtrip via csv write/read should result in the
> same set of rows
> ** changing the csv separator (when everything is properly esacped) should
> have no effect
> Potential resolutions:
> * When the configured delimiter consists of only whitespace
> ** deactivate the "skip empty line feature"
> ** or skip only lines that are completely empty (only a (carriage return)
> newline)
> * Change the skip empty line feature to only skip if the line is completely
> empty (only contains a newlin)
> ** this may break some user code that relies on the current behaviour
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]