Re: Confusing argument of sql.functions.count

2016-06-22 Thread Xinh Huynh
I can see how the linked documentation could be confusing:
"Aggregate function: returns the number of items in a group."

What it doesn't mention is that it returns the number of rows for which the
given column is non-null.

Xinh

On Wed, Jun 22, 2016 at 9:31 AM, Takeshi Yamamuro 
wrote:

> Hi,
>
> An argument for `functions.count` is needed for per-column counting;
> df.groupBy($"a").agg(count($"b"))
>
> // maropu
>
> On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu  wrote:
>
>> See the first example in:
>>
>> http://www.w3schools.com/sql/sql_func_count.asp
>>
>> On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky <
>> spark.dubovsky.ja...@gmail.com> wrote:
>>
>>> Hey Ted,
>>>
>>> thanks for reacting.
>>>
>>> I am refering to both of them. They both take column as parameter
>>> regardless of its type. Intuition here is that count should take no
>>> parameter. Or am I missing something?
>>>
>>> Jakub
>>>
>>> On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu  wrote:
>>>
 Are you referring to the following method in
 sql/core/src/main/scala/org/apache/spark/sql/functions.scala :

   def count(e: Column): Column = withAggregateFunction {

 Did you notice this method ?

   def count(columnName: String): TypedColumn[Any, Long] =

 On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
 spark.dubovsky.ja...@gmail.com> wrote:

> Hey sparkers,
>
> an aggregate function *count* in *org.apache.spark.sql.functions*
> package takes a *column* as an argument. Is this needed for
> something? I find it confusing that I need to supply a column there. It
> feels like it might be distinct count or something. This can be seen in 
> latest
> documentation
> 
> .
>
> I am considering filling this in spark bug tracker. Any opinions on
> this?
>
> Thanks
>
> Jakub
>
>

>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
Nice reactions. My comments:

@Ted.Yu: I see now that count(*) works for what I want
@Takeshi: I understand this is the syntax but it was not clear to me what
this $"b" column will be used for...

My line of thinking was this:

I started with
1) someDF.groupBy("colA").count()

and then I realized I need an average of colB per group so I tried
2) someDF.groupBy("colA").agg( avg("colB"), count() )

but it failed because count needs an argument. I understand the situation
now. Thank you guys for clarification! However having future generations in
mind :) I still want to poke around:

- Usages of count in 1) and 2) are still a bit inconsistent to me. If this
is the way have 2) works why there is no column arg in 1)?
- I would expect a glimpse of all of this would be in scaladoc for the
methods. The difference between their scala doc strings are hard to catch:
- - usage 1) in org.apache.spark.sql.GroupedData: Count the number of *rows*
for each group...
- - usage 2) in org.apache.spark.sql.functions: ...returns the number of
*items* in a group...

Thanks

On Wed, Jun 22, 2016 at 6:31 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> An argument for `functions.count` is needed for per-column counting;
> df.groupBy($"a").agg(count($"b"))
>
> // maropu
>
> On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu  wrote:
>
>> See the first example in:
>>
>> http://www.w3schools.com/sql/sql_func_count.asp
>>
>> On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky <
>> spark.dubovsky.ja...@gmail.com> wrote:
>>
>>> Hey Ted,
>>>
>>> thanks for reacting.
>>>
>>> I am refering to both of them. They both take column as parameter
>>> regardless of its type. Intuition here is that count should take no
>>> parameter. Or am I missing something?
>>>
>>> Jakub
>>>
>>> On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu  wrote:
>>>
 Are you referring to the following method in
 sql/core/src/main/scala/org/apache/spark/sql/functions.scala :

   def count(e: Column): Column = withAggregateFunction {

 Did you notice this method ?

   def count(columnName: String): TypedColumn[Any, Long] =

 On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
 spark.dubovsky.ja...@gmail.com> wrote:

> Hey sparkers,
>
> an aggregate function *count* in *org.apache.spark.sql.functions*
> package takes a *column* as an argument. Is this needed for
> something? I find it confusing that I need to supply a column there. It
> feels like it might be distinct count or something. This can be seen in 
> latest
> documentation
> 
> .
>
> I am considering filling this in spark bug tracker. Any opinions on
> this?
>
> Thanks
>
> Jakub
>
>

>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Confusing argument of sql.functions.count

2016-06-22 Thread Takeshi Yamamuro
Hi,

An argument for `functions.count` is needed for per-column counting;
df.groupBy($"a").agg(count($"b"))

// maropu

On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu  wrote:

> See the first example in:
>
> http://www.w3schools.com/sql/sql_func_count.asp
>
> On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky <
> spark.dubovsky.ja...@gmail.com> wrote:
>
>> Hey Ted,
>>
>> thanks for reacting.
>>
>> I am refering to both of them. They both take column as parameter
>> regardless of its type. Intuition here is that count should take no
>> parameter. Or am I missing something?
>>
>> Jakub
>>
>> On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu  wrote:
>>
>>> Are you referring to the following method in
>>> sql/core/src/main/scala/org/apache/spark/sql/functions.scala :
>>>
>>>   def count(e: Column): Column = withAggregateFunction {
>>>
>>> Did you notice this method ?
>>>
>>>   def count(columnName: String): TypedColumn[Any, Long] =
>>>
>>> On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
>>> spark.dubovsky.ja...@gmail.com> wrote:
>>>
 Hey sparkers,

 an aggregate function *count* in *org.apache.spark.sql.functions*
 package takes a *column* as an argument. Is this needed for something?
 I find it confusing that I need to supply a column there. It feels like it
 might be distinct count or something. This can be seen in latest
 documentation
 
 .

 I am considering filling this in spark bug tracker. Any opinions on
 this?

 Thanks

 Jakub


>>>
>>
>


-- 
---
Takeshi Yamamuro


Re: Confusing argument of sql.functions.count

2016-06-22 Thread Ted Yu
See the first example in:

http://www.w3schools.com/sql/sql_func_count.asp

On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky <
spark.dubovsky.ja...@gmail.com> wrote:

> Hey Ted,
>
> thanks for reacting.
>
> I am refering to both of them. They both take column as parameter
> regardless of its type. Intuition here is that count should take no
> parameter. Or am I missing something?
>
> Jakub
>
> On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu  wrote:
>
>> Are you referring to the following method in
>> sql/core/src/main/scala/org/apache/spark/sql/functions.scala :
>>
>>   def count(e: Column): Column = withAggregateFunction {
>>
>> Did you notice this method ?
>>
>>   def count(columnName: String): TypedColumn[Any, Long] =
>>
>> On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
>> spark.dubovsky.ja...@gmail.com> wrote:
>>
>>> Hey sparkers,
>>>
>>> an aggregate function *count* in *org.apache.spark.sql.functions*
>>> package takes a *column* as an argument. Is this needed for something?
>>> I find it confusing that I need to supply a column there. It feels like it
>>> might be distinct count or something. This can be seen in latest
>>> documentation
>>> 
>>> .
>>>
>>> I am considering filling this in spark bug tracker. Any opinions on this?
>>>
>>> Thanks
>>>
>>> Jakub
>>>
>>>
>>
>


Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
Hey Ted,

thanks for reacting.

I am refering to both of them. They both take column as parameter
regardless of its type. Intuition here is that count should take no
parameter. Or am I missing something?

Jakub

On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu  wrote:

> Are you referring to the following method in
> sql/core/src/main/scala/org/apache/spark/sql/functions.scala :
>
>   def count(e: Column): Column = withAggregateFunction {
>
> Did you notice this method ?
>
>   def count(columnName: String): TypedColumn[Any, Long] =
>
> On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
> spark.dubovsky.ja...@gmail.com> wrote:
>
>> Hey sparkers,
>>
>> an aggregate function *count* in *org.apache.spark.sql.functions*
>> package takes a *column* as an argument. Is this needed for something? I
>> find it confusing that I need to supply a column there. It feels like it
>> might be distinct count or something. This can be seen in latest
>> documentation
>> 
>> .
>>
>> I am considering filling this in spark bug tracker. Any opinions on this?
>>
>> Thanks
>>
>> Jakub
>>
>>
>


Re: Confusing argument of sql.functions.count

2016-06-22 Thread Ted Yu
Are you referring to the following method in
sql/core/src/main/scala/org/apache/spark/sql/functions.scala :

  def count(e: Column): Column = withAggregateFunction {

Did you notice this method ?

  def count(columnName: String): TypedColumn[Any, Long] =

On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
spark.dubovsky.ja...@gmail.com> wrote:

> Hey sparkers,
>
> an aggregate function *count* in *org.apache.spark.sql.functions* package
> takes a *column* as an argument. Is this needed for something? I find it
> confusing that I need to supply a column there. It feels like it might be
> distinct count or something. This can be seen in latest documentation
> 
> .
>
> I am considering filling this in spark bug tracker. Any opinions on this?
>
> Thanks
>
> Jakub
>
>