Shivam Dalmia created SPARK-19488:
-------------------------------------
Summary: CSV infer schema does not take into account Inf,-Inf,NaN
Key: SPARK-19488
URL: https://issues.apache.org/jira/browse/SPARK-19488
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.0.2
Environment: Windows 10, SparkShell
Reporter: Shivam Dalmia
I observed that while loading a CSV as a dataframe, user-specified values for
nanValue, positiveInf and negativeInf are disregarded when inferSchema = true.
(They work if a user-specified schema is provided). However, even the spark
defaults for the infinities (Inf and -Inf) do not work with inferSchema.
Taking a look at the source code for the inferSchema for CSV
(CSVInferSchema.scala), I found the following code snippet.
{code}
1. private def tryParseDouble(field: String, options: CSVOptions):
DataType = {
2. if ((allCatch opt field.toDouble).isDefined) {
3. DoubleType
4. } else {
5. tryParseTimestamp(field, options)
6. }
7. }
8.
9. private def tryParseTimestamp(field: String, options:
CSVOptions): DataType = {
10. // This case infers a custom `dataFormat` is set.
11. if ((allCatch opt
options.timestampFormat.parse(field)).isDefined) {
12. TimestampType
13. } else if ((allCatch opt
DateTimeUtils.stringToTime(field)).isDefined) {
14. // We keep this for backwords competibility.
15. TimestampType
16. } else {
17. tryParseBoolean(field, options)
18. }
19. }
{code}
Interestingly, the user-specified csv options are not at all used while
determining if the field is of type double (as we can see in line 2). We can
see that the options is used for timestamp type (line 11), which is why the
'dateFormat' option does work.
However, when the field is NaN, it works because scala's toDouble function does
convert the string NaN to the double equivalent of NaN. (I tried it using the
shell):
{code}
scala> allCatch.opt(field.toDouble)
res12: Option[Double] = Some(8.0942)
scala> var field = "NaN";
field: String = NaN
scala> allCatch.opt(field.toDouble)
res13: Option[Double] = Some(NaN)
scala> var field = "Inf";
field: String = Inf
scala> allCatch.opt(field.toDouble)
res14: Option[Double] = None
{code}
Interestingly, scala does have Double equivalents of Infinity and -Infinity
(but spark defaults are Inf and -Inf, which is why they don't work):
{code}
scala> field = "Infinity";
field: String = Infinity
scala> allCatch.opt(field.toDouble)
res15: Option[Double] = Some(Infinity)
scala> field = "-Infinity";
field: String = -Infinity
scala> allCatch.opt(field.toDouble)
res16: Option[Double] = Some(-Infinity)
{code}
The following csv, when ingested with inferSchema = true, therefore interprets
the value column as a Double! (Regardless of the user-specified options)
{code}
ID,name,value,irrational,prime,real
1,e,2.7,true,false,true
2,pi,3.14,true,false,true
3,inf,Infinity,false,false,true
4,-inf,-Infinity,false,false,true
5,i,NaN,false,false,false
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]