Chris Nasrallah created SPARK-18358:
---------------------------------------

             Summary: Multiple Aggregation Using 'countDistinct' and 'first' 
result in error 
                 Key: SPARK-18358
                 URL: https://issues.apache.org/jira/browse/SPARK-18358
             Project: Spark
          Issue Type: Bug
         Environment: Mac OS X 10.9.5
Apache Spark 2.0.1
Hadoop 1.4
            Reporter: Chris Nasrallah


Using pyspark, when I attempt to perform multiple aggregations on the same 
groupBy object using the functions 'first' and 'countDistinct' it results in a 
Py4JJavaError.

{code:borderStyle=solid}
from pyspark.sql import SparkSession
import pyspark.sql.functions as sfn

sparkSession = SparkSession.builder.master('local').getOrCreate()

df = spark.createDataFrame([
        (1, 'a', 'z'),
        (1, 'b', 'x'),
        (1, 'a', 'y'),
        (1, 'a', 'x'),
        (2, 'b', 'z'),
        (2, 'b', 'z')
    ], ['id', 'var1', 'var2'])

## Using two 'first' and one 'countDistinct' aggregations works
df.groupby('id')    \
        .agg(sfn.first('var1'),  \
                sfn.first('var2'),  \
                sfn.countDistinct('var1')).show()
                         
## Using one 'max' with both 'countDistinct' works:
df.groupby('id')    \
         .agg(sfn.max('var2'),                \
                 sfn.countDistinct('var1'),   \
                 sfn.countDistinct('var2')).show()

## But using both 'countDistinct' with at least one 'first' crashes
df.groupby('id')    \
        .agg(sfn.first('var1'),   \
                sfn.first('var2'),   \
                sfn.countDistinct('var1'), \
                sfn.countDistinct('var2')) \
        .show()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to