Re: [sql] Dataframe how to check null values

Ted Yu Mon, 20 Apr 2015 05:26:59 -0700

I found:
https://issues.apache.org/jira/browse/SPARK-6573




> On Apr 20, 2015, at 4:29 AM, Peter Rudenko <[email protected]> wrote:
> 
> Sounds very good. Is there a jira for this? Would be cool to have in 1.4, 
> because currently cannot use dataframe.describe function with NaN values, 
> need to filter manually all the columns.
> 
> Thanks,
> Peter Rudenko
> 
>> On 2015-04-02 21:18, Reynold Xin wrote:
>> Incidentally, we were discussing this yesterday. Here are some thoughts on 
>> null handling in SQL/DataFrames. Would be great to get some feedback.
>> 
>> 1. Treat floating point NaN and null as the same "null" value. This would be 
>> consistent with most SQL databases, and Pandas. This would also require some 
>> inbound conversion.
>> 
>> 2. Internally, when we see a NaN value, we should mark the null bit as true, 
>> and keep the NaN value. When we see a null value for a floating point field, 
>> we should mark the null bit as true, and update the field to store NaN.
>> 
>> 3. Externally, for floating point values, return NaN when the value is null.
>> 
>> 4. For all other types, return null for null values.
>> 
>> 5. For UDFs, if the argument is primitive type only (i.e. does not handle 
>> null) and not a floating point field, simply evaluate the expression to 
>> null. This is consistent with most SQL UDFs and most programming languages' 
>> treatment of NaN.
>> 
>> 
>> Any thoughts on this semantics?
>> 
>> 
>> On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>    I'm afraid you're a little stuck. In Scala, the types Int, Long,
>>    Float,
>>    Double, Byte, and Boolean look like reference types in source
>>    code, but
>>    they are compiled to the corresponding JVM primitive types, which
>>    can't be
>>    null. That's why you get the warning about ==.
>> 
>>    It might be your best choice is to use NaN as the placeholder for
>>    null,
>>    then create one DF using a filter that removes those values. Use
>>    that DF to
>>    compute the mean. Then apply a map step to the original DF to
>>    translate the
>>    NaN's to the mean.
>> 
>>    dean
>> 
>>    Dean Wampler, Ph.D.
>>    Author: Programming Scala, 2nd Edition
>>    <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>    Typesafe <http://typesafe.com>
>>    @deanwampler <http://twitter.com/deanwampler>
>>    http://polyglotprogramming.com
>> 
>>    On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko
>>    <[email protected] <mailto:[email protected]>>
>>    wrote:
>> 
>>    > Hi i need to implement MeanImputor - impute missing values with
>>    mean. If i
>>    > set missing values to null - then dataframe aggregation works
>>    properly, but
>>    > in UDF it treats null values to 0.0. Here’s example:
>>    >
>>    > |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
>>    > df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
>>    > df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
>>    > df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0
>>    null 0.0 val
>>    > df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0,
>>    Double.NaN)).toDF
>>    > df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row =
>>    [Double.NaN] |
>>    >
>>    > In UDF i cannot compare scala’s Double to null:
>>    >
>>    > |comparing values of types Double and Null using `==' will
>>    always yield
>>    > false [warn] if (value==null) meanValue else value |
>>    >
>>    > With Double.NaN instead of null i can compare in UDF, but
>>    aggregation
>>    > doesn’t work properly. Maybe it’s related to :
>>    https://issues.apache.org/
>>    > jira/browse/SPARK-6573
>>    >
>>    > Thanks,
>>    > Peter Rudenko
>>    >
>>    > 
>>    >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [sql] Dataframe how to check null values

Reply via email to