[
https://issues.apache.org/jira/browse/SPARK-21263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069911#comment-16069911
]
Sean Owen commented on SPARK-21263:
-----------------------------------
CC [~falaki] as well for the original code
Yeah, tough one. The original code is trying to handle Locale, as I expected.
The Spark version does not as (for other good reasons) it is not sensitive to
the machine's locale.
I think the right behavior is therefore to fail on this type of input. I think
it's more a fix than behavior change, IMHO, because getting "10" out of
"10u000" silently doesn't sound like a good idea.
We could use {{.toDouble}}. We can also keep the current code but check whether
it consumed all the input by checking {{ParsePosition}} afterwards. I note
that, for example, the current code would parse "10e3" as "10", whereas
{{.toDouble}} would parse as 10000.0. So using the latter does introduce small
behavior changes, but again, it seems less surprising to parse that correctly
as scientific notation, like standard JVM parsing routines would?
> NumberFormatException is not thrown while converting an invalid string to
> float/double
> --------------------------------------------------------------------------------------
>
> Key: SPARK-21263
> URL: https://issues.apache.org/jira/browse/SPARK-21263
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 2.1.1
> Reporter: Navya Krishnappa
>
> When reading a below-mentioned data by specifying user-defined schema,
> exception is not thrown. Refer the details :
> *Data:*
> 'PatientID','PatientName','TotalBill'
> '1000','Patient1','10u000'
> '1001','Patient2','30000'
> '1002','Patient3','40000'
> '1003','Patient4','50000'
> '1004','Patient5','60000'
> *Source code*:
> Dataset dataset = sparkSession.read().schema(schema)
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> When we collect the dataset data:
> dataset.collectAsList();
> *Schema1*:
> [StructField(PatientID,IntegerType,true),
> StructField(PatientName,StringType,true),
> StructField(TotalBill,IntegerType,true)]
> *Result *: Throws NumerFormatException
> Caused by: java.lang.NumberFormatException: For input string: "10u000"
> *Schema2*:
> [StructField(PatientID,IntegerType,true),
> StructField(PatientName,StringType,true),
> StructField(TotalBill,DoubleType,true)]
> *Actual Result*:
> "PatientID": 1000,
> "NumberOfVisits": "400",
> "TotalBill": 10,
> *Expected Result*: Should throw NumberFormatException for input string
> "10u000"
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]