Rahul Tanwani created SPARK-13309: ------------------------------------- Summary: Incorrect type inference for CSV data. Key: SPARK-13309 URL: https://issues.apache.org/jira/browse/SPARK-13309 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Rahul Tanwani Fix For: 1.6.0
Type inference for CSV data does not work as expected when the data is sparse. For instance: Consider the following datasets and the inferred schema: A,B,C,D 1,,, ,1,, ,,1, ,,,1 root |-- A: integer (nullable = true) |-- B: integer (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true) Here all the fields should have been inferred as Integer types, but clearly the inferred schema is different. Another dataset: A,B,C,D 1,,1, and the inferred schema: root |-- A: string (nullable = true) |-- B: string (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true) Here, fields A & C should be inferred as Integer types. Same issue has been discussed on spark-csv package. Please take a look at https://github.com/databricks/spark-csv/issues/216 for reference. The issue was fixed with https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d . I will try to submit PR with the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org