GitHub user MaxGekk opened a pull request:

    https://github.com/apache/spark/pull/22656

    [SPARK-25669][SQL] Check CSV header only when it exists

    ## What changes were proposed in this pull request?
    
    Currently the first row of dataset of CSV strings is compared to field 
names of user specified or inferred schema independently of presence of CSV 
header. It causes false-positive error messages. For example, parsing `"1,2"` 
outputs the error:
    
    ```java
    java.lang.IllegalArgumentException: CSV header does not conform to the 
schema.
     Header: 1, 2
     Schema: _c0, _c1
    Expected: _c0 but found: 1
    ```
    
    In the PR, I propose:
    - Checking CSV header only when it exists
    - Filter header from the input dataset only if it exists
    
    ## How was this patch tested?
    
    Added a test to `CSVSuite` which reproduces the issue.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MaxGekk/spark-1 inferred-header-check

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22656.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22656
    
----
commit 676e5580e5f01c7800c734634911b65a91531d4b
Author: Maxim Gekk <maxim.gekk@...>
Date:   2018-10-06T12:33:28Z

    Don't need to check inferred field names to the first row

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to