thadeusb commented on a change in pull request #23080: [SPARK-26108][SQL]
Support custom lineSep in CSV datasource
URL: https://github.com/apache/spark/pull/23080#discussion_r272700796
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -192,6 +192,20 @@ class CSVOptions(
*/
val emptyValueInWrite = emptyValue.getOrElse("\"\"")
+ /**
+ * A string between two consecutive JSON records.
+ */
+ val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
+ require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
+ require(sep.length == 1, "'lineSep' can contain only 1 character.")
Review comment:
I am setting multiLine = "true".
The problem I am having with this is the column name of the last column in
the CSV header gets a \r added to the end of it.
So if I have
name,age,text\r\nfred,30,"likes\r\npie,cookies,milk"\njill,30,"likes\ncake,cookies,milk"\r\n
I was getting schema with
StringType("NAME")
IntegerType("AGE")
StringType("TEXT\r")
Could it be the mixed use of \r\n and \n so it only wants to use \n for
newlines?
Another issue is the configuration for lineSep is controlled upstream from a
different configuration provided by users who have no knowledge of spark, but
know how they formatted their CSV files, and without some re-architecture, it
is not possible to detect that this setting is set to \r\n and then set it to
None for the CSVOptions.
lineSeparator.foreach(format.setLineSeparator) already handles 1 to 2
characters so I figured this is a safe thing to support for lineSep
configuration no?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]