HyukjinKwon opened a new pull request #27561: [SPARK-30810][SQL] Parses and 
convert a CSV Dataset having different column from 'value' in csv(dataset) API
URL: https://github.com/apache/spark/pull/27561
 
 
   ### What changes were proposed in this pull request?
   
   This PR fixes `DataFrameReader.csv(dataset: Dataset[String])` API to take a 
`Dataset[String]` originated from a column name different from `value`.
   
   `CSVUtils.filterCommentAndEmpty` assumed the `Dataset[String]` to be 
originated with `value` column. This PR changes to use the first column name in 
the schema.
   
   ### Why are the changes needed?
   
   For  `DataFrameReader.csv(dataset: Dataset[String])` to support any 
`Dataset[String]` as the signature indicates.
   
   ### Does this PR introduce any user-facing change?
   Yes,
   
   ```scala
   val ds = spark.range(2).selectExpr("concat('a,b,', id) AS text").as[String]
   spark.read.option("header", true).option("inferSchema", true).csv(ds).show()
   ```
   
   Before:
   
   ```
   org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input 
columns: [text];;
   'Filter (length(trim('value, None)) > 0)
   +- Project [concat(a,b,, cast(id#0L as string)) AS text#2]
      +- Range (0, 2, step=1, splits=Some(2))
   ```
   
   After:
   
   ```
   +---+---+---+
   |  a|  b|  0|
   +---+---+---+
   |  a|  b|  1|
   +---+---+---+
   ```
   
   
   ### How was this patch tested?
   
   Unittest was added.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to