Look at the implementation for frequently items.  It is a different from
true count.
On Jul 21, 2015 1:19 PM, "Reynold Xin" <r...@databricks.com> wrote:

> Is this just frequent items?
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97
>
>
>
> On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com>
> wrote:
>
>> 100% I would love to do it.  Who a good person to review the design
>> with.  All I need is a quick chat about the design and approach and I'll
>> create the jira and push a patch.
>>
>> Ted Malaska
>>
>> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi Ted,
>>> The TopNList would be great to see directly in the Dataframe API and my
>>> wish would be to be able to apply it on multiple columns at the same time
>>> and get all these statistics.
>>> the .describe() function is close to what we want to achieve, maybe we
>>> could try to enrich its output.
>>> Anyway, even as a spark-package, if you could package your code for
>>> Dataframes, that would be great.
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>
>>> :
>>>
>>>> Ha ok !
>>>>
>>>> Then generic part would have that signature :
>>>>
>>>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]
>>>>
>>>>
>>>> +1 for more work (blog / api) for data quality checks.
>>>>
>>>> Cheers,
>>>> Jonathan
>>>>
>>>>
>>>> TopCMSParams and some other monoids from Algebird are really cool for
>>>> that :
>>>>
>>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590
>>>>
>>>>
>>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote:
>>>>
>>>>> I'm guessing you want something like what I put in this blog post.
>>>>>
>>>>>
>>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>>>>>
>>>>> This is a very common use case.  If there is a +1 I would love to add
>>>>> it to dataframes.
>>>>>
>>>>> Let me know
>>>>> Ted Malaska
>>>>>
>>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>
>>>>>> Yop,
>>>>>> actually the generic part does not work, the countByValue on one
>>>>>> column gives you the count for each value seen in the column.
>>>>>> I would like a generic (multi-column) countByValue to give me the
>>>>>> same kind of output for each column, not considering each n-uples of each
>>>>>> column value as the key (which is what the groupBy is doing by default).
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Olivier
>>>>>>
>>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <
>>>>>> jonathan.wina...@gmail.com>:
>>>>>>
>>>>>>> Ahoy !
>>>>>>>
>>>>>>> Maybe you can get countByValue by using sql.GroupedData :
>>>>>>>
>>>>>>> // some DFval df: DataFrame = 
>>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType))))
>>>>>>>
>>>>>>>
>>>>>>> df.groupBy("n").count().show()
>>>>>>>
>>>>>>>
>>>>>>> // generic
>>>>>>> def countByValueDf(df:DataFrame) = {
>>>>>>>
>>>>>>>   val (h :: r) = df.columns.toList
>>>>>>>
>>>>>>>   df.groupBy(h, r:_*).count()
>>>>>>> }
>>>>>>>
>>>>>>> countByValueDf(df).show()
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jon
>>>>>>>
>>>>>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Is there any plan to add the countByValue function to Spark SQL
>>>>>>>> Dataframe ?
>>>>>>>> Even
>>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>>>>>> is using the RDD part right now, but for ML purposes, being able to 
>>>>>>>> get the
>>>>>>>> most frequent categorical value on multiple columns would be very 
>>>>>>>> useful.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Olivier Girardot* | Associé
>>>>>>>> o.girar...@lateral-thoughts.com
>>>>>>>> +33 6 24 09 17 94
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Olivier Girardot* | Associé
>>>>>> o.girar...@lateral-thoughts.com
>>>>>> +33 6 24 09 17 94
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>> +33 6 24 09 17 94
>>>
>>
>>
>

Reply via email to