Re: [sql] Dataframe how to check null values

2015-04-20 Thread Ted Yu
I found:
https://issues.apache.org/jira/browse/SPARK-6573



> On Apr 20, 2015, at 4:29 AM, Peter Rudenko  wrote:
> 
> Sounds very good. Is there a jira for this? Would be cool to have in 1.4, 
> because currently cannot use dataframe.describe function with NaN values, 
> need to filter manually all the columns.
> 
> Thanks,
> Peter Rudenko
> 
>> On 2015-04-02 21:18, Reynold Xin wrote:
>> Incidentally, we were discussing this yesterday. Here are some thoughts on 
>> null handling in SQL/DataFrames. Would be great to get some feedback.
>> 
>> 1. Treat floating point NaN and null as the same "null" value. This would be 
>> consistent with most SQL databases, and Pandas. This would also require some 
>> inbound conversion.
>> 
>> 2. Internally, when we see a NaN value, we should mark the null bit as true, 
>> and keep the NaN value. When we see a null value for a floating point field, 
>> we should mark the null bit as true, and update the field to store NaN.
>> 
>> 3. Externally, for floating point values, return NaN when the value is null.
>> 
>> 4. For all other types, return null for null values.
>> 
>> 5. For UDFs, if the argument is primitive type only (i.e. does not handle 
>> null) and not a floating point field, simply evaluate the expression to 
>> null. This is consistent with most SQL UDFs and most programming languages' 
>> treatment of NaN.
>> 
>> 
>> Any thoughts on this semantics?
>> 
>> 
>> On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler > > wrote:
>> 
>>I'm afraid you're a little stuck. In Scala, the types Int, Long,
>>Float,
>>Double, Byte, and Boolean look like reference types in source
>>code, but
>>they are compiled to the corresponding JVM primitive types, which
>>can't be
>>null. That's why you get the warning about ==.
>> 
>>It might be your best choice is to use NaN as the placeholder for
>>null,
>>then create one DF using a filter that removes those values. Use
>>that DF to
>>compute the mean. Then apply a map step to the original DF to
>>translate the
>>NaN's to the mean.
>> 
>>dean
>> 
>>Dean Wampler, Ph.D.
>>Author: Programming Scala, 2nd Edition
>> (O'Reilly)
>>Typesafe 
>>@deanwampler 
>>http://polyglotprogramming.com
>> 
>>On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko
>>mailto:petro.rude...@gmail.com>>
>>wrote:
>> 
>>> Hi i need to implement MeanImputor - impute missing values with
>>mean. If i
>>> set missing values to null - then dataframe aggregation works
>>properly, but
>>> in UDF it treats null values to 0.0. Here’s example:
>>>
>>> |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
>>> df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
>>> df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
>>> df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0
>>null 0.0 val
>>> df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0,
>>Double.NaN)).toDF
>>> df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row =
>>[Double.NaN] |
>>>
>>> In UDF i cannot compare scala’s Double to null:
>>>
>>> |comparing values of types Double and Null using `==' will
>>always yield
>>> false [warn] if (value==null) meanValue else value |
>>>
>>> With Double.NaN instead of null i can compare in UDF, but
>>aggregation
>>> doesn’t work properly. Maybe it’s related to :
>>https://issues.apache.org/
>>> jira/browse/SPARK-6573
>>>
>>> Thanks,
>>> Peter Rudenko
>>>
>>> ​
>>>
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [sql] Dataframe how to check null values

2015-04-20 Thread Peter Rudenko
Sounds very good. Is there a jira for this? Would be cool to have in 
1.4, because currently cannot use dataframe.describe function with NaN 
values, need to filter manually all the columns.


Thanks,
Peter Rudenko

On 2015-04-02 21:18, Reynold Xin wrote:
Incidentally, we were discussing this yesterday. Here are some 
thoughts on null handling in SQL/DataFrames. Would be great to get 
some feedback.


1. Treat floating point NaN and null as the same "null" value. This 
would be consistent with most SQL databases, and Pandas. This would 
also require some inbound conversion.


2. Internally, when we see a NaN value, we should mark the null bit as 
true, and keep the NaN value. When we see a null value for a floating 
point field, we should mark the null bit as true, and update the field 
to store NaN.


3. Externally, for floating point values, return NaN when the value is 
null.


4. For all other types, return null for null values.

5. For UDFs, if the argument is primitive type only (i.e. does not 
handle null) and not a floating point field, simply evaluate the 
expression to null. This is consistent with most SQL UDFs and most 
programming languages' treatment of NaN.



Any thoughts on this semantics?


On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler > wrote:


I'm afraid you're a little stuck. In Scala, the types Int, Long,
Float,
Double, Byte, and Boolean look like reference types in source
code, but
they are compiled to the corresponding JVM primitive types, which
can't be
null. That's why you get the warning about ==.

It might be your best choice is to use NaN as the placeholder for
null,
then create one DF using a filter that removes those values. Use
that DF to
compute the mean. Then apply a map step to the original DF to
translate the
NaN's to the mean.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko
mailto:petro.rude...@gmail.com>>
wrote:

> Hi i need to implement MeanImputor - impute missing values with
mean. If i
> set missing values to null - then dataframe aggregation works
properly, but
> in UDF it treats null values to 0.0. Here’s example:
>
> |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
> df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
> df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
> df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0
null 0.0 val
> df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0,
Double.NaN)).toDF
> df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row =
[Double.NaN] |
>
> In UDF i cannot compare scala’s Double to null:
>
> |comparing values of types Double and Null using `==' will
always yield
> false [warn] if (value==null) meanValue else value |
>
> With Double.NaN instead of null i can compare in UDF, but
aggregation
> doesn’t work properly. Maybe it’s related to :
https://issues.apache.org/
> jira/browse/SPARK-6573
>
> Thanks,
> Peter Rudenko
>
> ​
>






Re: [sql] Dataframe how to check null values

2015-04-02 Thread Reynold Xin
Incidentally, we were discussing this yesterday. Here are some thoughts on
null handling in SQL/DataFrames. Would be great to get some feedback.

1. Treat floating point NaN and null as the same "null" value. This would
be consistent with most SQL databases, and Pandas. This would also require
some inbound conversion.

2. Internally, when we see a NaN value, we should mark the null bit as
true, and keep the NaN value. When we see a null value for a floating point
field, we should mark the null bit as true, and update the field to store
NaN.

3. Externally, for floating point values, return NaN when the value is null.

4. For all other types, return null for null values.

5. For UDFs, if the argument is primitive type only (i.e. does not handle
null) and not a floating point field, simply evaluate the expression to
null. This is consistent with most SQL UDFs and most programming languages'
treatment of NaN.


Any thoughts on this semantics?


On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler  wrote:

> I'm afraid you're a little stuck. In Scala, the types Int, Long, Float,
> Double, Byte, and Boolean look like reference types in source code, but
> they are compiled to the corresponding JVM primitive types, which can't be
> null. That's why you get the warning about ==.
>
> It might be your best choice is to use NaN as the placeholder for null,
> then create one DF using a filter that removes those values. Use that DF to
> compute the mean. Then apply a map step to the original DF to translate the
> NaN's to the mean.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko 
> wrote:
>
> > Hi i need to implement MeanImputor - impute missing values with mean. If
> i
> > set missing values to null - then dataframe aggregation works properly,
> but
> > in UDF it treats null values to 0.0. Here’s example:
> >
> > |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
> > df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
> > df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
> > df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0 null 0.0
> val
> > df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0,
> Double.NaN)).toDF
> > df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row = [Double.NaN]
> |
> >
> > In UDF i cannot compare scala’s Double to null:
> >
> > |comparing values of types Double and Null using `==' will always yield
> > false [warn] if (value==null) meanValue else value |
> >
> > With Double.NaN instead of null i can compare in UDF, but aggregation
> > doesn’t work properly. Maybe it’s related to :
> https://issues.apache.org/
> > jira/browse/SPARK-6573
> >
> > Thanks,
> > Peter Rudenko
> >
> > ​
> >
>


Re: [sql] Dataframe how to check null values

2015-04-02 Thread Dean Wampler
I'm afraid you're a little stuck. In Scala, the types Int, Long, Float,
Double, Byte, and Boolean look like reference types in source code, but
they are compiled to the corresponding JVM primitive types, which can't be
null. That's why you get the warning about ==.

It might be your best choice is to use NaN as the placeholder for null,
then create one DF using a filter that removes those values. Use that DF to
compute the mean. Then apply a map step to the original DF to translate the
NaN's to the mean.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko 
wrote:

> Hi i need to implement MeanImputor - impute missing values with mean. If i
> set missing values to null - then dataframe aggregation works properly, but
> in UDF it treats null values to 0.0. Here’s example:
>
> |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
> df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
> df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
> df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0 null 0.0 val
> df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0, Double.NaN)).toDF
> df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row = [Double.NaN] |
>
> In UDF i cannot compare scala’s Double to null:
>
> |comparing values of types Double and Null using `==' will always yield
> false [warn] if (value==null) meanValue else value |
>
> With Double.NaN instead of null i can compare in UDF, but aggregation
> doesn’t work properly. Maybe it’s related to : https://issues.apache.org/
> jira/browse/SPARK-6573
>
> Thanks,
> Peter Rudenko
>
> ​
>


[sql] Dataframe how to check null values

2015-04-02 Thread Peter Rudenko
Hi i need to implement MeanImputor - impute missing values with mean. If 
i set missing values to null - then dataframe aggregation works 
properly, but in UDF it treats null values to 0.0. Here’s example:


|val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF 
df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75] 
df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType, 
df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0 null 0.0 
val df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0, 
Double.NaN)).toDF df.agg(avg("_1")).first //res46: 
org.apache.spark.sql.Row = [Double.NaN] |


In UDF i cannot compare scala’s Double to null:

|comparing values of types Double and Null using `==' will always yield 
false [warn] if (value==null) meanValue else value |


With Double.NaN instead of null i can compare in UDF, but aggregation 
doesn’t work properly. Maybe it’s related to : 
https://issues.apache.org/jira/browse/SPARK-6573


Thanks,
Peter Rudenko

​