[GitHub] spark pull request #21296: [SPARK-24244][SQL] CSV column pruning

MaxGekk Thu, 10 May 2018 11:58:50 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21296#discussion_r187426203
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
    @@ -267,7 +267,7 @@ class CSVSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils with Te
             .options(Map("header" -> "true", "mode" -> "dropmalformed"))
             .load(testFile(carsFile))
     
    -      assert(cars.select("year").collect().size === 2)
    +      assert(cars.collect().size === 2)
    --- End diff --
    
    The `cars.csv` file has header with 5 columns:
    ```
    year,make,model,comment,blank
    ```
    and 2 rows with 4 valid columns and the last one is blank:
    ```
    "2012","Tesla","S","No comment",
    1997,Ford,E350,"Go get one now they are going fast",
    ```
    and one more row with only with 3 columns:
    ```
    2015,Chevy,Volt
    ```
    Previous (current) implementation drops the last row in the `dropmalformed` 
mode because it parses whole rows, and the last one is incorrect. If only the 
`year` column is selected, uniVocity parser returns values for first column 
(with index `0`) and doesn't analyze correctness of the rest part of the rows. 
So in this way `cars.select("year").collect().size` returns `3`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21296: [SPARK-24244][SQL] CSV column pruning

Reply via email to