Shivam Dalmia created SPARK-19488:
-------------------------------------

             Summary: CSV infer schema does not take into account Inf,-Inf,NaN
                 Key: SPARK-19488
                 URL: https://issues.apache.org/jira/browse/SPARK-19488
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.2
         Environment: Windows 10, SparkShell
            Reporter: Shivam Dalmia


I observed that while loading a CSV as a dataframe, user-specified values for 
nanValue, positiveInf and negativeInf are disregarded when inferSchema = true. 
(They work if a user-specified schema is provided). However, even the spark 
defaults for the infinities (Inf and -Inf) do not work with inferSchema. 

Taking a look at the source code for the inferSchema for CSV 
(CSVInferSchema.scala), I found the following code snippet.
{code}
1.              private def tryParseDouble(field: String, options: CSVOptions): 
DataType = {
2.                  if ((allCatch opt field.toDouble).isDefined) {
3.                    DoubleType
4.                  } else {
5.                    tryParseTimestamp(field, options)
6.                  }
7.                }
8.              
9.                private def tryParseTimestamp(field: String, options: 
CSVOptions): DataType = {
10.                 // This case infers a custom `dataFormat` is set.
11.                 if ((allCatch opt 
options.timestampFormat.parse(field)).isDefined) {
12.                   TimestampType
13.                 } else if ((allCatch opt 
DateTimeUtils.stringToTime(field)).isDefined) {
14.                   // We keep this for backwords competibility.
15.                   TimestampType
16.                 } else {
17.                   tryParseBoolean(field, options)
18.                 }
19.               }
{code}
Interestingly, the user-specified csv options are not at all used while 
determining if the field is of type double (as we can see in line 2). We can 
see that the options is used for timestamp type (line 11), which is why the 
'dateFormat' option does work. 
However, when the field is NaN, it works because scala's toDouble function does 
convert the string NaN to the double equivalent of NaN. (I tried it using the 
shell):

{code}
scala> allCatch.opt(field.toDouble)
res12: Option[Double] = Some(8.0942)

scala> var field = "NaN";
field: String = NaN

scala> allCatch.opt(field.toDouble)
res13: Option[Double] = Some(NaN)

scala> var field = "Inf";
field: String = Inf

scala> allCatch.opt(field.toDouble)
res14: Option[Double] = None
{code}
Interestingly, scala does have Double equivalents of Infinity and -Infinity 
(but spark defaults are Inf and -Inf, which is why they don't work):

{code}
scala> field = "Infinity";
field: String = Infinity

scala> allCatch.opt(field.toDouble)
res15: Option[Double] = Some(Infinity)

scala> field = "-Infinity";
field: String = -Infinity

scala> allCatch.opt(field.toDouble)
res16: Option[Double] = Some(-Infinity)
{code}

The following csv, when ingested with inferSchema = true, therefore interprets 
the value column as a Double! (Regardless of the user-specified options)

{code}
ID,name,value,irrational,prime,real
1,e,2.7,true,false,true
2,pi,3.14,true,false,true
3,inf,Infinity,false,false,true
4,-inf,-Infinity,false,false,true
5,i,NaN,false,false,false

{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to