Rahul Tanwani created SPARK-13309:
-------------------------------------

             Summary: Incorrect type inference for CSV data.
                 Key: SPARK-13309
                 URL: https://issues.apache.org/jira/browse/SPARK-13309
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.0
            Reporter: Rahul Tanwani
             Fix For: 1.6.0


Type inference for CSV data does not work as expected when the data is sparse. 
For instance: Consider the following datasets and the inferred schema:

A,B,C,D
1,,,
,1,,
,,1,
,,,1

root
|-- A: integer (nullable = true)
|-- B: integer (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)

Here all the fields should have been inferred as Integer types, but clearly the 
inferred schema is different.

Another dataset:

A,B,C,D
1,,1,

and the inferred schema:

root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)

Here, fields A & C should be inferred as Integer types. 

Same issue has been discussed on spark-csv package. Please take a look at 
https://github.com/databricks/spark-csv/issues/216 for reference. 

The issue was fixed with 
https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d
 . I will try to submit PR with the patch soon.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to