[ 
https://issues.apache.org/jira/browse/SPARK-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18269:
--------------------------------
    Description: 
Having a schema with a nullable column thrown an 
java.lang.NumberFormatException: null when the data + delimeter isn't specified 
in the csv.

Specifying the schema:

{code}
StructType(Array(
  StructField("id", IntegerType, nullable = false),
  StructField("underlyingId", IntegerType, true)
))
{code}

Data (without trailing delimeter to specify the second column):
{code}
1
{code}

Read the data:
{code}
sparkSession.read
    .schema(sourceSchema)
    .option("header", "false")
    .option("delimiter", """\t""")
    .csv(files(dates): _*)
    .rdd
{code}

Actual Result: 
{code}
java.lang.NumberFormatException: null
        at java.lang.Integer.parseInt(Integer.java:542)
        at java.lang.Integer.parseInt(Integer.java:615)
        at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
{code}

Reason:
The csv line is parsed into a Map (indexSafeTokens), which is short of one 
value. So indexSafeTokens(index) throws a NullpointerException reading the 
optional value which isn't in the Map.

The NullpointerException is then given to the CSVTypeCast.castTo(datum: String, 
.....) as the datum value.
The subsequent NumberFormatException is thrown due to the fact that a 
NullpointerException cannot be cast into the Type.

Possible fix:
- Use the provided schema to parse the line with the correct number of columns
- Since its nullable implement a try catch on CSVRelation.csvParser 
indexSafeTokens(index)

  was:
Having a schema with a nullable column thrown an 
java.lang.NumberFormatException: null when the data + delimeter isn't specified 
in the csv.

Specifying the schema:
StructType(Array(
  StructField("id", IntegerType, nullable = false),
  StructField("underlyingId", IntegerType, true)
))

Data (without trailing delimeter to specify the second column):
1

Read the data:
sparkSession.read
    .schema(sourceSchema)
    .option("header", "false")
    .option("delimiter", """\t""")
    .csv(files(dates): _*)
    .rdd

Actual Result: 
java.lang.NumberFormatException: null
        at java.lang.Integer.parseInt(Integer.java:542)
        at java.lang.Integer.parseInt(Integer.java:615)
        at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
        at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)

Reason:
The csv line is parsed into a Map (indexSafeTokens), which is short of one 
value. So indexSafeTokens(index) throws a NullpointerException reading the 
optional value which isn't in the Map.

The NullpointerException is then given to the CSVTypeCast.castTo(datum: String, 
.....) as the datum value.
The subsequent NumberFormatException is thrown due to the fact that a 
NullpointerException cannot be cast into the Type.

Possible fix:
- Use the provided schema to parse the line with the correct number of columns
- Since its nullable implement a try catch on CSVRelation.csvParser 
indexSafeTokens(index)


> NumberFormatException when reading csv for a nullable column
> ------------------------------------------------------------
>
>                 Key: SPARK-18269
>                 URL: https://issues.apache.org/jira/browse/SPARK-18269
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Jork Zijlstra
>
> Having a schema with a nullable column thrown an 
> java.lang.NumberFormatException: null when the data + delimeter isn't 
> specified in the csv.
> Specifying the schema:
> {code}
> StructType(Array(
>   StructField("id", IntegerType, nullable = false),
>   StructField("underlyingId", IntegerType, true)
> ))
> {code}
> Data (without trailing delimeter to specify the second column):
> {code}
> 1
> {code}
> Read the data:
> {code}
> sparkSession.read
>     .schema(sourceSchema)
>     .option("header", "false")
>     .option("delimiter", """\t""")
>     .csv(files(dates): _*)
>     .rdd
> {code}
> Actual Result: 
> {code}
> java.lang.NumberFormatException: null
>       at java.lang.Integer.parseInt(Integer.java:542)
>       at java.lang.Integer.parseInt(Integer.java:615)
>       at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>       at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> Reason:
> The csv line is parsed into a Map (indexSafeTokens), which is short of one 
> value. So indexSafeTokens(index) throws a NullpointerException reading the 
> optional value which isn't in the Map.
> The NullpointerException is then given to the CSVTypeCast.castTo(datum: 
> String, .....) as the datum value.
> The subsequent NumberFormatException is thrown due to the fact that a 
> NullpointerException cannot be cast into the Type.
> Possible fix:
> - Use the provided schema to parse the line with the correct number of columns
> - Since its nullable implement a try catch on CSVRelation.csvParser 
> indexSafeTokens(index)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to