I went ahead and downloaded/compiled Spark 2.0 to try your code snippet
Takeshi. It unfortunately doesn't compile.

scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

scala> ds.groupBy($"_1").count.select($"_1", $"count").show
<console>:28: error: type mismatch;
 found   : org.apache.spark.sql.ColumnName
 required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row,
Long),?]
              ds.groupBy($"_1").count.select($"_1", $"count").show
                                                                 ^

I also gave a try to Xinh's suggestion using the code snippet below
(partially from spark docs)
scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2), Person("Pedro",
24), Person("Bob", 42)).toDS()
scala> ds.groupBy(_.name).count.select($"name".as[String]).show
org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input
columns: [];
scala> ds.groupBy(_.name).count.select($"_1".as[String]).show
org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input
columns: [];
scala> ds.groupBy($"name").count.select($"_1".as[String]).show
org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input
columns: [];

Looks like there are empty columns for some reason, the code below works
fine for the simple aggregate
scala> ds.groupBy(_.name).count.show

Would be great to see an idiomatic example of using aggregates like these
mixed with spark.sql.functions.

Pedro

On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> Thanks Xinh and Takeshi,
>
> I am trying to avoid map since my impression is that this uses a Scala
> closure so is not optimized as well as doing column-wise operations is.
>
> Looks like the $ notation is the way to go, thanks for the help. Is there
> an explanation of how this works? I imagine it is a method/function with
> its name defined as $ in Scala?
>
> Lastly, are there prelim Spark 2.0 docs? If there isn't a good
> description/guide of using this syntax I would be willing to contribute
> some documentation.
>
> Pedro
>
> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Hi,
>>
>> In 2.0, you can say;
>> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>> ds.groupBy($"_1").count.select($"_1", $"count").show
>>
>>
>> // maropu
>>
>>
>> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh <xinh.hu...@gmail.com> wrote:
>>
>>> Hi Pedro,
>>>
>>> In 1.6.1, you can do:
>>> >> ds.groupBy(_.uid).count().map(_._1)
>>> or
>>> >> ds.groupBy(_.uid).count().select($"value".as[String])
>>>
>>> It doesn't have the exact same syntax as for DataFrame.
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
>>>
>>> It might be different in 2.0.
>>>
>>> Xinh
>>>
>>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez <
>>> ski.rodrig...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its
>>>> released.
>>>>
>>>> I am running the aggregate code below where I have a dataset where the
>>>> row has a field uid:
>>>>
>>>> ds.groupBy(_.uid).count()
>>>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string,
>>>> _2: bigint]
>>>>
>>>> This works as expected, however, attempts to run select statements
>>>> after fails:
>>>> ds.groupBy(_.uid).count().select(_._1)
>>>> // error: missing parameter type for expanded function ((x$2) =>
>>>> x$2._1)
>>>> ds.groupBy(_.uid).count().select(_._1)
>>>>
>>>> I have tried several variants, but nothing seems to work. Below is the
>>>> equivalent Dataframe code which works as expected:
>>>> df.groupBy("uid").count().select("uid")
>>>>
>>>> Thanks!
>>>> --
>>>> Pedro Rodriguez
>>>> PhD Student in Distributed Machine Learning | CU Boulder
>>>> UC Berkeley AMPLab Alumni
>>>>
>>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>>>> Github: github.com/EntilZha | LinkedIn:
>>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>>
>>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Reply via email to