Ahoy ! Maybe you can get countByValue by using sql.GroupedData :
// some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType)))) df.groupBy("n").count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot <o.girar...@lateral-thoughts.com> wrote: > Hi, > Is there any plan to add the countByValue function to Spark SQL Dataframe ? > Even > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 > is using the RDD part right now, but for ML purposes, being able to get the > most frequent categorical value on multiple columns would be very useful. > > > Regards, > > > -- > *Olivier Girardot* | AssociƩ > o.girar...@lateral-thoughts.com > +33 6 24 09 17 94 >