[GitHub] [spark] HyukjinKwon opened a new pull request #27561: [SPARK-30810][SQL] Parses and convert a CSV Dataset having different column from 'value' in csv(dataset) API

GitBox Thu, 13 Feb 2020 03:30:11 -0800

HyukjinKwon opened a new pull request #27561: [SPARK-30810][SQL] Parses and 
convert a CSV Dataset having different column from 'value' in csv(dataset) API
URL: https://github.com/apache/spark/pull/27561
 
 
   ### What changes were proposed in this pull request?
   
   This PR fixes `DataFrameReader.csv(dataset: Dataset[String])` API to take a 
`Dataset[String]` originated from a column name different from `value`.
   
   `CSVUtils.filterCommentAndEmpty` assumed the `Dataset[String]` to be 
originated with `value` column. This PR changes to use the first column name in 
the schema.
   
   ### Why are the changes needed?
   
   For  `DataFrameReader.csv(dataset: Dataset[String])` to support any 
`Dataset[String]` as the signature indicates.
   
   ### Does this PR introduce any user-facing change?
   Yes,
   
   ```scala
   val ds = spark.range(2).selectExpr("concat('a,b,', id) AS text").as[String]
   spark.read.option("header", true).option("inferSchema", true).csv(ds).show()
   ```
   
   Before:
   
   ```
   org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input 
columns: [text];;
   'Filter (length(trim('value, None)) > 0)
   +- Project [concat(a,b,, cast(id#0L as string)) AS text#2]
      +- Range (0, 2, step=1, splits=Some(2))
   ```
   
   After:
   
   ```
   +---+---+---+
   |  a|  b|  0|
   +---+---+---+
   |  a|  b|  1|
   +---+---+---+
   ```
   
   
   ### How was this patch tested?
   
   Unittest was added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #27561: [SPARK-30810][SQL] Parses and convert a CSV Dataset having different column from 'value' in csv(dataset) API

Reply via email to