I found: https://issues.apache.org/jira/browse/SPARK-6573
> On Apr 20, 2015, at 4:29 AM, Peter Rudenko <petro.rude...@gmail.com> wrote: > > Sounds very good. Is there a jira for this? Would be cool to have in 1.4, > because currently cannot use dataframe.describe function with NaN values, > need to filter manually all the columns. > > Thanks, > Peter Rudenko > >> On 2015-04-02 21:18, Reynold Xin wrote: >> Incidentally, we were discussing this yesterday. Here are some thoughts on >> null handling in SQL/DataFrames. Would be great to get some feedback. >> >> 1. Treat floating point NaN and null as the same "null" value. This would be >> consistent with most SQL databases, and Pandas. This would also require some >> inbound conversion. >> >> 2. Internally, when we see a NaN value, we should mark the null bit as true, >> and keep the NaN value. When we see a null value for a floating point field, >> we should mark the null bit as true, and update the field to store NaN. >> >> 3. Externally, for floating point values, return NaN when the value is null. >> >> 4. For all other types, return null for null values. >> >> 5. For UDFs, if the argument is primitive type only (i.e. does not handle >> null) and not a floating point field, simply evaluate the expression to >> null. This is consistent with most SQL UDFs and most programming languages' >> treatment of NaN. >> >> >> Any thoughts on this semantics? >> >> >> On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler <deanwamp...@gmail.com >> <mailto:deanwamp...@gmail.com>> wrote: >> >> I'm afraid you're a little stuck. In Scala, the types Int, Long, >> Float, >> Double, Byte, and Boolean look like reference types in source >> code, but >> they are compiled to the corresponding JVM primitive types, which >> can't be >> null. That's why you get the warning about ==. >> >> It might be your best choice is to use NaN as the placeholder for >> null, >> then create one DF using a filter that removes those values. Use >> that DF to >> compute the mean. Then apply a map step to the original DF to >> translate the >> NaN's to the mean. >> >> dean >> >> Dean Wampler, Ph.D. >> Author: Programming Scala, 2nd Edition >> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) >> Typesafe <http://typesafe.com> >> @deanwampler <http://twitter.com/deanwampler> >> http://polyglotprogramming.com >> >> On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko >> <petro.rude...@gmail.com <mailto:petro.rude...@gmail.com>> >> wrote: >> >> > Hi i need to implement MeanImputor - impute missing values with >> mean. If i >> > set missing values to null - then dataframe aggregation works >> properly, but >> > in UDF it treats null values to 0.0. Here’s example: >> > >> > |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF >> > df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75] >> > df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType, >> > df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0 >> null 0.0 val >> > df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0, >> Double.NaN)).toDF >> > df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row = >> [Double.NaN] | >> > >> > In UDF i cannot compare scala’s Double to null: >> > >> > |comparing values of types Double and Null using `==' will >> always yield >> > false [warn] if (value==null) meanValue else value | >> > >> > With Double.NaN instead of null i can compare in UDF, but >> aggregation >> > doesn’t work properly. Maybe it’s related to : >> https://issues.apache.org/ >> > jira/browse/SPARK-6573 >> > >> > Thanks, >> > Peter Rudenko >> > >> > >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org