Costas Piliotis created SPARK-31583:
---------------------------------------

             Summary: grouping_id calculation should be improved
                 Key: SPARK-31583
                 URL: https://issues.apache.org/jira/browse/SPARK-31583
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.4.5
            Reporter: Costas Piliotis


Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
exclusion from a grouping_set rather than inclusion, when performing complex 
grouping_sets that are not in the order of the base select statement, flipping 
the bit in the grouping_id seems to be happen when the grouping set is 
identified rather than when the columns are selected in the sql.   I will of 
course use the exclusion strategy identified in SPARK-21858 as the baseline for 
this.  

 
{code:scala}
import spark.implicits._
val df= Seq(
 ("a","b","c","d"),
 ("a","b","c","d"),
 ("a","b","c","d"),
 ("a","b","c","d")
).toDF("a","b","c","d").createOrReplaceTempView("abc")
{code}
expected to have these references in the grouping_id:
 d=1
 c=2
 b=4
 a=8
{code:scala}
spark.sql("""
 select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
 from abc
 group by GROUPING SETS (
 (),
 (a,b,d),
 (a,c),
 (a,d)
 )
 """).show(false)
{code}
This returns:
{noformat}
+----+----+----+----+--------+---+-------+
|a   |b   |c   |d   |count(1)|gid|gid_bin|
+----+----+----+----+--------+---+-------+
|a   |null|c   |null|4       |6  |110    |
|null|null|null|null|4       |15 |1111   |
|a   |null|null|d   |4       |5  |101    |
|a   |b   |null|d   |4       |1  |1      |
+----+----+----+----+--------+---+-------+
{noformat}
 

 In other words, I would have expected the excluded values one way but I 
received them excluded in the order they were first seen in the specified 
grouping sets.

 a,b,d included = excldes c = 2; expected gid=2. received gid=1
 a,d included = excludes b=4, c=2 expected gid=6, received gid=5

The grouping_id that actually is expected is (a,b,d,c) 

{code:scala}
spark.sql("""
 select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
bin(grouping_id(a,b,d,c)) as gid_bin
 from abc
 group by GROUPING SETS (
 (),
 (a,b,d),
 (a,c),
 (a,d)
 )
 """).show(false)
{code}


 columns forming groupingid seem to be creatred as the grouping sets are 
identified rather than ordinal position in parent query.

I'd like to at least point out that grouping_id is documented in many other 
rdbms and I believe the spark project should use a policy of flipping the bits 
so 1=inclusion; 0=exclusion in the grouping set.

However many rdms that do have the feature of a grouping_id do implement it by 
the ordinal position recognized as fields in the select clause, rather than 
allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to