[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

MaxGekk Sun, 25 Mar 2018 02:45:08 -0700

Github user MaxGekk commented on the issue:

    https://github.com/apache/spark/pull/20894
  
    @HyukjinKwon 
    > I think we are fine to just document this like saying them to better use 
select or renaming it after the load
    
    The problem occurs during loading. Could you, please, explain how select or 
renaming of the columns could solve the issue which I described above. Spark 
just loads data silently. And some partitions have wrong data. 
    
    Please, do this experiment:
    
    _Create two files 1.csv and 2.csv in the same folder_
    ```
    $ cat 1.csv
    temperature, depth
    10.0, 5.0
    ```
    ```
    $ cat 2.csv
    depth, temperature
    1234, 4.1
    ```
    _Read the files by Spark:_
    ```
    val data = spark.read.option("header", "true").csv("folder/*.csv")
    data.select("temperature").show
    ```
    I as an user would expect either:
    ```
    +-----------+
    |temperature|
    +-----------+
    |       10.0|
    |        4.1|
    +-----------+
    ```
    or an error but not this output:
    ```
    +-----------+
    |temperature|
    +-----------+
    |       10.0|
    |       1234|
    +-----------+
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

Reply via email to