[
https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin updated SPARK-18269:
--------------------------------
Description:
Having a schema with a nullable column thrown an
java.lang.NumberFormatException: null when the data + delimeter isn't specified
in the csv.
Specifying the schema:
{code}
StructType(Array(
StructField("id", IntegerType, nullable = false),
StructField("underlyingId", IntegerType, true)
))
{code}
Data (without trailing delimeter to specify the second column):
{code}
1
{code}
Read the data:
{code}
sparkSession.read
.schema(sourceSchema)
.option("header", "false")
.option("delimiter", """\t""")
.csv(files(dates): _*)
.rdd
{code}
Actual Result:
{code}
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
{code}
Reason:
The csv line is parsed into a Map (indexSafeTokens), which is short of one
value. So indexSafeTokens(index) throws a NullpointerException reading the
optional value which isn't in the Map.
The NullpointerException is then given to the CSVTypeCast.castTo(datum: String,
.....) as the datum value.
The subsequent NumberFormatException is thrown due to the fact that a
NullpointerException cannot be cast into the Type.
Possible fix:
- Use the provided schema to parse the line with the correct number of columns
- Since its nullable implement a try catch on CSVRelation.csvParser
indexSafeTokens(index)
was:
Having a schema with a nullable column thrown an
java.lang.NumberFormatException: null when the data + delimeter isn't specified
in the csv.
Specifying the schema:
StructType(Array(
StructField("id", IntegerType, nullable = false),
StructField("underlyingId", IntegerType, true)
))
Data (without trailing delimeter to specify the second column):
1
Read the data:
sparkSession.read
.schema(sourceSchema)
.option("header", "false")
.option("delimiter", """\t""")
.csv(files(dates): _*)
.rdd
Actual Result:
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
Reason:
The csv line is parsed into a Map (indexSafeTokens), which is short of one
value. So indexSafeTokens(index) throws a NullpointerException reading the
optional value which isn't in the Map.
The NullpointerException is then given to the CSVTypeCast.castTo(datum: String,
.....) as the datum value.
The subsequent NumberFormatException is thrown due to the fact that a
NullpointerException cannot be cast into the Type.
Possible fix:
- Use the provided schema to parse the line with the correct number of columns
- Since its nullable implement a try catch on CSVRelation.csvParser
indexSafeTokens(index)
> NumberFormatException when reading csv for a nullable column
> ------------------------------------------------------------
>
> Key: SPARK-18269
> URL: https://issues.apache.org/jira/browse/SPARK-18269
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.1
> Reporter: Jork Zijlstra
>
> Having a schema with a nullable column thrown an
> java.lang.NumberFormatException: null when the data + delimeter isn't
> specified in the csv.
> Specifying the schema:
> {code}
> StructType(Array(
> StructField("id", IntegerType, nullable = false),
> StructField("underlyingId", IntegerType, true)
> ))
> {code}
> Data (without trailing delimeter to specify the second column):
> {code}
> 1
> {code}
> Read the data:
> {code}
> sparkSession.read
> .schema(sourceSchema)
> .option("header", "false")
> .option("delimiter", """\t""")
> .csv(files(dates): _*)
> .rdd
> {code}
> Actual Result:
> {code}
> java.lang.NumberFormatException: null
> at java.lang.Integer.parseInt(Integer.java:542)
> at java.lang.Integer.parseInt(Integer.java:615)
> at
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> Reason:
> The csv line is parsed into a Map (indexSafeTokens), which is short of one
> value. So indexSafeTokens(index) throws a NullpointerException reading the
> optional value which isn't in the Map.
> The NullpointerException is then given to the CSVTypeCast.castTo(datum:
> String, .....) as the datum value.
> The subsequent NumberFormatException is thrown due to the fact that a
> NullpointerException cannot be cast into the Type.
> Possible fix:
> - Use the provided schema to parse the line with the correct number of columns
> - Since its nullable implement a try catch on CSVRelation.csvParser
> indexSafeTokens(index)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]