I went ahead and downloaded/compiled Spark 2.0 to try your code snippet Takeshi. It unfortunately doesn't compile.
scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int] scala> ds.groupBy($"_1").count.select($"_1", $"count").show <console>:28: error: type mismatch; found : org.apache.spark.sql.ColumnName required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row, Long),?] ds.groupBy($"_1").count.select($"_1", $"count").show ^ I also gave a try to Xinh's suggestion using the code snippet below (partially from spark docs) scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2), Person("Pedro", 24), Person("Bob", 42)).toDS() scala> ds.groupBy(_.name).count.select($"name".as[String]).show org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: []; scala> ds.groupBy(_.name).count.select($"_1".as[String]).show org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: []; scala> ds.groupBy($"name").count.select($"_1".as[String]).show org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns: []; Looks like there are empty columns for some reason, the code below works fine for the simple aggregate scala> ds.groupBy(_.name).count.show Would be great to see an idiomatic example of using aggregates like these mixed with spark.sql.functions. Pedro On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Thanks Xinh and Takeshi, > > I am trying to avoid map since my impression is that this uses a Scala > closure so is not optimized as well as doing column-wise operations is. > > Looks like the $ notation is the way to go, thanks for the help. Is there > an explanation of how this works? I imagine it is a method/function with > its name defined as $ in Scala? > > Lastly, are there prelim Spark 2.0 docs? If there isn't a good > description/guide of using this syntax I would be willing to contribute > some documentation. > > Pedro > > On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> Hi, >> >> In 2.0, you can say; >> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS >> ds.groupBy($"_1").count.select($"_1", $"count").show >> >> >> // maropu >> >> >> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.hu...@gmail.com> wrote: >> >>> Hi Pedro, >>> >>> In 1.6.1, you can do: >>> >> ds.groupBy(_.uid).count().map(_._1) >>> or >>> >> ds.groupBy(_.uid).count().select($"value".as[String]) >>> >>> It doesn't have the exact same syntax as for DataFrame. >>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset >>> >>> It might be different in 2.0. >>> >>> Xinh >>> >>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez < >>> ski.rodrig...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its >>>> released. >>>> >>>> I am running the aggregate code below where I have a dataset where the >>>> row has a field uid: >>>> >>>> ds.groupBy(_.uid).count() >>>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, >>>> _2: bigint] >>>> >>>> This works as expected, however, attempts to run select statements >>>> after fails: >>>> ds.groupBy(_.uid).count().select(_._1) >>>> // error: missing parameter type for expanded function ((x$2) => >>>> x$2._1) >>>> ds.groupBy(_.uid).count().select(_._1) >>>> >>>> I have tried several variants, but nothing seems to work. Below is the >>>> equivalent Dataframe code which works as expected: >>>> df.groupBy("uid").count().select("uid") >>>> >>>> Thanks! >>>> -- >>>> Pedro Rodriguez >>>> PhD Student in Distributed Machine Learning | CU Boulder >>>> UC Berkeley AMPLab Alumni >>>> >>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >>>> Github: github.com/EntilZha | LinkedIn: >>>> https://www.linkedin.com/in/pedrorodriguezscience >>>> >>>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > > > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience