Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, "Reynold Xin" <r...@databricks.com> wrote:
> Is this just frequent items? > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 > > > > On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com> > wrote: > >> 100% I would love to do it. Who a good person to review the design >> with. All I need is a quick chat about the design and approach and I'll >> create the jira and push a patch. >> >> Ted Malaska >> >> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Hi Ted, >>> The TopNList would be great to see directly in the Dataframe API and my >>> wish would be to be able to apply it on multiple columns at the same time >>> and get all these statistics. >>> the .describe() function is close to what we want to achieve, maybe we >>> could try to enrich its output. >>> Anyway, even as a spark-package, if you could package your code for >>> Dataframes, that would be great. >>> >>> Regards, >>> >>> Olivier. >>> >>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com> >>> : >>> >>>> Ha ok ! >>>> >>>> Then generic part would have that signature : >>>> >>>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] >>>> >>>> >>>> +1 for more work (blog / api) for data quality checks. >>>> >>>> Cheers, >>>> Jonathan >>>> >>>> >>>> TopCMSParams and some other monoids from Algebird are really cool for >>>> that : >>>> >>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 >>>> >>>> >>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote: >>>> >>>>> I'm guessing you want something like what I put in this blog post. >>>>> >>>>> >>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ >>>>> >>>>> This is a very common use case. If there is a +1 I would love to add >>>>> it to dataframes. >>>>> >>>>> Let me know >>>>> Ted Malaska >>>>> >>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < >>>>> o.girar...@lateral-thoughts.com> wrote: >>>>> >>>>>> Yop, >>>>>> actually the generic part does not work, the countByValue on one >>>>>> column gives you the count for each value seen in the column. >>>>>> I would like a generic (multi-column) countByValue to give me the >>>>>> same kind of output for each column, not considering each n-uples of each >>>>>> column value as the key (which is what the groupBy is doing by default). >>>>>> >>>>>> Regards, >>>>>> >>>>>> Olivier >>>>>> >>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy < >>>>>> jonathan.wina...@gmail.com>: >>>>>> >>>>>>> Ahoy ! >>>>>>> >>>>>>> Maybe you can get countByValue by using sql.GroupedData : >>>>>>> >>>>>>> // some DFval df: DataFrame = >>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType)))) >>>>>>> >>>>>>> >>>>>>> df.groupBy("n").count().show() >>>>>>> >>>>>>> >>>>>>> // generic >>>>>>> def countByValueDf(df:DataFrame) = { >>>>>>> >>>>>>> val (h :: r) = df.columns.toList >>>>>>> >>>>>>> df.groupBy(h, r:_*).count() >>>>>>> } >>>>>>> >>>>>>> countByValueDf(df).show() >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Jon >>>>>>> >>>>>>> On 20 July 2015 at 11:28, Olivier Girardot < >>>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> Is there any plan to add the countByValue function to Spark SQL >>>>>>>> Dataframe ? >>>>>>>> Even >>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>>>>>>> is using the RDD part right now, but for ML purposes, being able to >>>>>>>> get the >>>>>>>> most frequent categorical value on multiple columns would be very >>>>>>>> useful. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Olivier Girardot* | Associé >>>>>>>> o.girar...@lateral-thoughts.com >>>>>>>> +33 6 24 09 17 94 >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Olivier Girardot* | Associé >>>>>> o.girar...@lateral-thoughts.com >>>>>> +33 6 24 09 17 94 >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> *Olivier Girardot* | Associé >>> o.girar...@lateral-thoughts.com >>> +33 6 24 09 17 94 >>> >> >> >