HyukjinKwon opened a new pull request, #36294:
URL: https://github.com/apache/spark/pull/36294

   ### What changes were proposed in this pull request?
   
   This PR proposes to disable `lineSep` option in `from_csv` and 
`schema_of_csv` expression by setting Noncharacters according to [unicode 
specification](https://www.unicode.org/charts/PDF/UFFF0.pdf), `\UFFFF`. This 
can be used for the internal purpose in a program according to the 
specification.
   
   The Univocity parser does not allow omit the line separator (from my code 
reading) so this approach was proposed.
   
   This specific code path is not affected by our `encoding` or `charset` 
option because Unicovity parser parses them as unicodes as are internally.
   
   ### Why are the changes needed?
   
   Currently, this option is weirdly effective. See the example of `from_csv` 
as below:
   
   ```scala
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.functions._
   
   Seq[String]("1,\n2,3,4,5").toDF.select(
     col("value"),
     from_csv(
       col("value"),
       StructType(Seq(StructField("a", LongType), StructField("b", StringType)
     )), Map[String,String]())).show()
   ```
   
   ```
   +-----------+---------------+
   |      value|from_csv(value)|
   +-----------+---------------+
   |1,\n2,3,4,5|      {1, null}|
   +-----------+---------------+
   ```
   
   `{1, null}` has to be `{1, \n2}`.
   
   The CSV expressions cannot easily make it supported because this option is 
plan-wise option that can change the number of returned rows; however, the 
expressions are designed to emit one row only whereas this option is easily 
effective in the scan plan with CSV data source. Therefore, we should disable 
this option.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, now the `lineSep` can be located in the output from `from_csv` and 
`schema_of_csv`.
   
   ### How was this patch tested?
   
   Manually tested, and unit test was added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to