Ohh great! Thanks for the clarification. On Wed, Oct 28, 2015 at 4:21 PM, Reynold Xin <r...@databricks.com> wrote:
> No those are just functions for the DataFrame programming API. > > On Wed, Oct 28, 2015 at 11:49 AM, Shagun Sodhani <sshagunsodh...@gmail.com > > wrote: > >> @Reynold I seem to be missing something. Aren't the functions listed here >> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$> >> to >> be treated as sql operators as well? I do see that these are mentioned as >> Functions >> available for DataFrame >> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html> >> but >> it would be great if you can clarify this. >> >> On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> I don't think these are bugs. The SQL standard for average is "avg", not >>> "mean". Similarly, a distinct count is supposed to be written as >>> "count(distinct col)", not "countDistinct(col)". >>> >>> We can, however, make "mean" an alias for "avg" to improve compatibility >>> between DataFrame and SQL. >>> >>> >>> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani < >>> sshagunsodh...@gmail.com> wrote: >>> >>>> Also are the other aggregate functions to be treated as bugs or not? >>>> >>>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani < >>>> sshagunsodh...@gmail.com> wrote: >>>> >>>>> Wouldnt it be: >>>>> >>>>> + expression[Max]("avg"), >>>>> >>>>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>>> Since there is already Average, the simplest change is the following: >>>>>> >>>>>> $ git diff >>>>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>>> diff --git >>>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi >>>>>> index 3dce6c1..920f95b 100644 >>>>>> --- >>>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>>> +++ >>>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>>> @@ -184,6 +184,7 @@ object FunctionRegistry { >>>>>> expression[Last]("last"), >>>>>> expression[Last]("last_value"), >>>>>> expression[Max]("max"), >>>>>> + expression[Average]("mean"), >>>>>> expression[Min]("min"), >>>>>> expression[Stddev]("stddev"), >>>>>> expression[StddevPop]("stddev_pop"), >>>>>> >>>>>> FYI >>>>>> >>>>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani < >>>>>> sshagunsodh...@gmail.com> wrote: >>>>>> >>>>>>> I tried adding the aggregate functions in the registry and they >>>>>>> work, other than mean, for which Ted has forwarded some code changes. I >>>>>>> will try out those changes and update the status here. >>>>>>> >>>>>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani < >>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>> >>>>>>>> Yup avg works good. So we have alternate functions to use in place >>>>>>>> on the functions pointed out earlier. But my point is that are those >>>>>>>> original aggregate functions not supposed to be used or I am using >>>>>>>> them in >>>>>>>> the wrong way or is it a bug as I asked in my first mail. >>>>>>>> >>>>>>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu <yuzhih...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Have you tried using avg in place of mean ? >>>>>>>>> >>>>>>>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j, >>>>>>>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } >>>>>>>>> sqlContext.sql(""" >>>>>>>>> CREATE TEMPORARY TABLE partitionedParquet >>>>>>>>> USING org.apache.spark.sql.parquet >>>>>>>>> OPTIONS ( >>>>>>>>> path '/tmp/partitioned' >>>>>>>>> )""") >>>>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show() >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani < >>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> So I tried @Reynold's suggestion. I could get countDistinct and >>>>>>>>>> sumDistinct running but mean and approxCountDistinct do not >>>>>>>>>> work. (I guess I am using the wrong syntax for approxCountDistinct) >>>>>>>>>> For >>>>>>>>>> mean, I think the registry entry is missing. Can someone clarify >>>>>>>>>> that as >>>>>>>>>> well? >>>>>>>>>> >>>>>>>>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani < >>>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Will try in a while when I get back. I assume this applies to >>>>>>>>>>> all functions other than mean. Also countDistinct is defined along >>>>>>>>>>> with all >>>>>>>>>>> other SQL functions. So I don't get "distinct is not part of >>>>>>>>>>> function name" >>>>>>>>>>> part. >>>>>>>>>>> On 27 Oct 2015 19:58, "Reynold Xin" <r...@databricks.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Try >>>>>>>>>>>> >>>>>>>>>>>> count(distinct columnane) >>>>>>>>>>>> >>>>>>>>>>>> In SQL distinct is not part of the function name. >>>>>>>>>>>> >>>>>>>>>>>> On Tuesday, October 27, 2015, Shagun Sodhani < >>>>>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Oops seems I made a mistake. The error message is : Exception >>>>>>>>>>>>> in thread "main" org.apache.spark.sql.AnalysisException: >>>>>>>>>>>>> undefined function >>>>>>>>>>>>> countDistinct >>>>>>>>>>>>> On 27 Oct 2015 15:49, "Shagun Sodhani" < >>>>>>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi! I was trying out some aggregate functions in SparkSql >>>>>>>>>>>>>> and I noticed that certain aggregate operators are not working. >>>>>>>>>>>>>> This >>>>>>>>>>>>>> includes: >>>>>>>>>>>>>> >>>>>>>>>>>>>> approxCountDistinct >>>>>>>>>>>>>> countDistinct >>>>>>>>>>>>>> mean >>>>>>>>>>>>>> sumDistinct >>>>>>>>>>>>>> >>>>>>>>>>>>>> For example using countDistinct results in an error saying >>>>>>>>>>>>>> *Exception in thread "main" >>>>>>>>>>>>>> org.apache.spark.sql.AnalysisException: undefined function cosh;* >>>>>>>>>>>>>> >>>>>>>>>>>>>> I had a similar issue with cosh operator >>>>>>>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-when-using-cosh-td14724.html> >>>>>>>>>>>>>> as well some time back and it turned out that it was not >>>>>>>>>>>>>> registered in the >>>>>>>>>>>>>> registry: >>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *I* *think it is the same issue again and would be glad to >>>>>>>>>>>>>> send over a PR if someone can confirm if this is an actual bug >>>>>>>>>>>>>> and not some >>>>>>>>>>>>>> mistake on my part.* >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM >>>>>>>>>>>>>> `table` >>>>>>>>>>>>>> Spark Version: 10.4 >>>>>>>>>>>>>> SparkSql Version: 1.5.1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am using the standard example of (name, age) schema (though >>>>>>>>>>>>>> I am setting age as Double and not Int as I am trying out maths >>>>>>>>>>>>>> functions). >>>>>>>>>>>>>> >>>>>>>>>>>>>> The entire error stack can be found here >>>>>>>>>>>>>> <http://pastebin.com/G6YzQXnn>. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >