Re: countByValue on dataframe with multiple columns

Jonathan Winandy Mon, 20 Jul 2015 05:19:35 -0700

Ahoy !

Maybe you can get countByValue by using sql.GroupedData :


// some DFval df: DataFrame =
sqlContext.createDataFrame(sc.parallelize(List("A","B", "B",
"A")).map(Row.apply(_)), StructType(List(StructField("n",
StringType))))


df.groupBy("n").count().show()


// generic
def countByValueDf(df:DataFrame) = {

  val (h :: r) = df.columns.toList

  df.groupBy(h, r:_*).count()
}

countByValueDf(df).show()


Cheers,
Jon

On 20 July 2015 at 11:28, Olivier Girardot <o.girar...@lateral-thoughts.com>
wrote:

> Hi,
> Is there any plan to add the countByValue function to Spark SQL Dataframe ?
> Even
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
> is using the RDD part right now, but for ML purposes, being able to get the
> most frequent categorical value on multiple columns would be very useful.
>
>
> Regards,
>
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: countByValue on dataframe with multiple columns

Reply via email to