[jira] [Commented] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data

Robert Joseph Evans (Jira) Wed, 20 Apr 2022 14:40:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-38955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525310#comment-17525310
 ]


Robert Joseph Evans commented on SPARK-38955:
---------------------------------------------

Conceptually I am fine if we want to remove all line separators in from_csv. 
That is what I would expect to happen, and that is what happens with from_json.

{code}
Seq[String]("{'a': 1\n}","{'a': \n3, 'b': 'test\n3'}", "{'a'\n: 4}", 
null).toDF.select(col("value"), from_json(col("value"), 
StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
Map[String,String]("allowUnquotedControlChars" -> "true"))).show(truncate=false)
+--------------------------+----------------+
|value                     |from_json(value)|
+--------------------------+----------------+
|{'a': 1\n}                |{1, null}       |
|{'a': \n3, 'b': 'test\n3'}|{3, test\n3}    |
|{'a'\n: 4}                |{4, null}       |
|null                      |null            |
+--------------------------+----------------+
{code}


But there is no way to turn off line separators in the CSV parser.

https://github.com/uniVocity/univocity-parsers/blob/7e7d1b3c0a3dceaed4a8413875eb1500f2a028ec/src/main/java/com/univocity/parsers/common/Format.java#L54-L65

So implementing the proposed fix may be difficult.  Replacing the default 
separator '\n' with another like '\0' might be okay, but I do know people with 
'\0' in their data so it is not truly fixing the problem.

An alternative might be to clear the state of the CSV parser after each row of 
input. i.e. read all of the tokens out of the parser after each row.  The '\n' 
is still parsed so the output of a single row is still not ideal if it has the 
line separator in it, but at least it does not corrupt the output of a good row 
after it.

> from_csv can corrupt surrounding lines if a lineSep is in the data
> ------------------------------------------------------------------
>
>                 Key: SPARK-38955
>                 URL: https://issues.apache.org/jira/browse/SPARK-38955
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Robert Joseph Evans
>            Priority: Blocker
>
> I don't know how critical this is. I was doing some general testing to 
> understand {{from_csv}} and found that if I happen to have a {{lineSep}} in 
> the input data and I noticed that the next row appears to be corrupted. 
> {{multiLine}} does not appear to fix it. Because this is data corruption I am 
> inclined to mark this as CRITICAL or BLOCKER, but it is an odd corner case so 
> I m not going to set it myself.
> {code}
> Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
> null).toDF.select(col("value"), from_csv(col("value"), 
> StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
> Map[String,String]())).show()
> +--------------+---------------+
> |         value|from_csv(value)|
> +--------------+---------------+
> |   1,\n2,3,4,5|      {1, null}|
> |    6,7,8,9,10|      {null, 8}|
> |11,12,13,14,15|       {11, 12}|
> |          null|           null|
> +--------------+---------------+
> {code}
> {code}
> Seq[String]("1,:2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
> null).toDF.select(col("value"), from_csv(col("value"), 
> StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
> Map[String,String]("lineSep" -> ":"))).show()
> +--------------+---------------+
> |         value|from_csv(value)|
> +--------------+---------------+
> |    1,:2,3,4,5|      {1, null}|
> |    6,7,8,9,10|      {null, 8}|
> |11,12,13,14,15|       {11, 12}|
> |          null|           null|
> +--------------+---------------+
> {code}
> {code}
> Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
> null).toDF.select(col("value"), from_csv(col("value"), 
> StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
> Map[String,String]("lineSep" -> ":"))).show()
> +--------------+---------------+
> |         value|from_csv(value)|
> +--------------+---------------+
> |   1,\n2,3,4,5|       {1, \n2}|
> |    6,7,8,9,10|         {6, 7}|
> |11,12,13,14,15|       {11, 12}|
> |          null|           null|
> +--------------+---------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data

Reply via email to