Costas Piliotis created SPARK-31583:
---------------------------------------
Summary: grouping_id calculation should be improved
Key: SPARK-31583
URL: https://issues.apache.org/jira/browse/SPARK-31583
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 2.4.5
Reporter: Costas Piliotis
Unrelated to SPARK-21858 which identifies that grouping_id is determined by
exclusion from a grouping_set rather than inclusion, when performing complex
grouping_sets that are not in the order of the base select statement, flipping
the bit in the grouping_id seems to be happen when the grouping set is
identified rather than when the columns are selected in the sql. I will of
course use the exclusion strategy identified in SPARK-21858 as the baseline for
this.
{code:scala}
import spark.implicits._
val df= Seq(
("a","b","c","d"),
("a","b","c","d"),
("a","b","c","d"),
("a","b","c","d")
).toDF("a","b","c","d").createOrReplaceTempView("abc")
{code}
expected to have these references in the grouping_id:
d=1
c=2
b=4
a=8
{code:scala}
spark.sql("""
select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
""").show(false)
{code}
This returns:
{noformat}
+----+----+----+----+--------+---+-------+
|a |b |c |d |count(1)|gid|gid_bin|
+----+----+----+----+--------+---+-------+
|a |null|c |null|4 |6 |110 |
|null|null|null|null|4 |15 |1111 |
|a |null|null|d |4 |5 |101 |
|a |b |null|d |4 |1 |1 |
+----+----+----+----+--------+---+-------+
{noformat}
In other words, I would have expected the excluded values one way but I
received them excluded in the order they were first seen in the specified
grouping sets.
a,b,d included = excldes c = 2; expected gid=2. received gid=1
a,d included = excludes b=4, c=2 expected gid=6, received gid=5
The grouping_id that actually is expected is (a,b,d,c)
{code:scala}
spark.sql("""
select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid,
bin(grouping_id(a,b,d,c)) as gid_bin
from abc
group by GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
""").show(false)
{code}
columns forming groupingid seem to be creatred as the grouping sets are
identified rather than ordinal position in parent query.
I'd like to at least point out that grouping_id is documented in many other
rdbms and I believe the spark project should use a policy of flipping the bits
so 1=inclusion; 0=exclusion in the grouping set.
However many rdms that do have the feature of a grouping_id do implement it by
the ordinal position recognized as fields in the select clause, rather than
allocating them as they are observed in the grouping sets.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]