[ https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-18269: -------------------------------- Description: Having a schema with a nullable column thrown an java.lang.NumberFormatException: null when the data + delimeter isn't specified in the csv. Specifying the schema: {code} StructType(Array( StructField("id", IntegerType, nullable = false), StructField("underlyingId", IntegerType, true) )) {code} Data (without trailing delimeter to specify the second column): {code} 1 {code} Read the data: {code} sparkSession.read .schema(sourceSchema) .option("header", "false") .option("delimiter", """\t""") .csv(files(dates): _*) .rdd {code} Actual Result: {code} java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) {code} Reason: The csv line is parsed into a Map (indexSafeTokens), which is short of one value. So indexSafeTokens(index) throws a NullpointerException reading the optional value which isn't in the Map. The NullpointerException is then given to the CSVTypeCast.castTo(datum: String, .....) as the datum value. The subsequent NumberFormatException is thrown due to the fact that a NullpointerException cannot be cast into the Type. Possible fix: - Use the provided schema to parse the line with the correct number of columns - Since its nullable implement a try catch on CSVRelation.csvParser indexSafeTokens(index) was: Having a schema with a nullable column thrown an java.lang.NumberFormatException: null when the data + delimeter isn't specified in the csv. Specifying the schema: StructType(Array( StructField("id", IntegerType, nullable = false), StructField("underlyingId", IntegerType, true) )) Data (without trailing delimeter to specify the second column): 1 Read the data: sparkSession.read .schema(sourceSchema) .option("header", "false") .option("delimiter", """\t""") .csv(files(dates): _*) .rdd Actual Result: java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) Reason: The csv line is parsed into a Map (indexSafeTokens), which is short of one value. So indexSafeTokens(index) throws a NullpointerException reading the optional value which isn't in the Map. The NullpointerException is then given to the CSVTypeCast.castTo(datum: String, .....) as the datum value. The subsequent NumberFormatException is thrown due to the fact that a NullpointerException cannot be cast into the Type. Possible fix: - Use the provided schema to parse the line with the correct number of columns - Since its nullable implement a try catch on CSVRelation.csvParser indexSafeTokens(index) > NumberFormatException when reading csv for a nullable column > ------------------------------------------------------------ > > Key: SPARK-18269 > URL: https://issues.apache.org/jira/browse/SPARK-18269 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.1 > Reporter: Jork Zijlstra > > Having a schema with a nullable column thrown an > java.lang.NumberFormatException: null when the data + delimeter isn't > specified in the csv. > Specifying the schema: > {code} > StructType(Array( > StructField("id", IntegerType, nullable = false), > StructField("underlyingId", IntegerType, true) > )) > {code} > Data (without trailing delimeter to specify the second column): > {code} > 1 > {code} > Read the data: > {code} > sparkSession.read > .schema(sourceSchema) > .option("header", "false") > .option("delimiter", """\t""") > .csv(files(dates): _*) > .rdd > {code} > Actual Result: > {code} > java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Integer.java:542) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) > {code} > Reason: > The csv line is parsed into a Map (indexSafeTokens), which is short of one > value. So indexSafeTokens(index) throws a NullpointerException reading the > optional value which isn't in the Map. > The NullpointerException is then given to the CSVTypeCast.castTo(datum: > String, .....) as the datum value. > The subsequent NumberFormatException is thrown due to the fact that a > NullpointerException cannot be cast into the Type. > Possible fix: > - Use the provided schema to parse the line with the correct number of columns > - Since its nullable implement a try catch on CSVRelation.csvParser > indexSafeTokens(index) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org