[
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094284#comment-17094284
]
Takeshi Yamamuro commented on SPARK-31583:
------------------------------------------
[~cpiliotis] Hi, thanks for your report! Just a question; you proposed the two
things below in this JIRA?
- reordering bit positions in grouping_id corresponding to a projection list
in select
- flipping the current output in grouping_id
> grouping_id calculation should be improved
> ------------------------------------------
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Costas Piliotis
> Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by
> exclusion from a grouping_set rather than inclusion, when performing complex
> grouping_sets that are not in the order of the base select statement,
> flipping the bit in the grouping_id seems to be happen when the grouping set
> is identified rather than when the columns are selected in the sql. I will
> of course use the exclusion strategy identified in SPARK-21858 as the
> baseline for this.
>
> {code:scala}
> import spark.implicits._
> val df= Seq(
> ("a","b","c","d"),
> ("a","b","c","d"),
> ("a","b","c","d"),
> ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
> d=1
> c=2
> b=4
> a=8
> {code:scala}
> spark.sql("""
> select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
> from abc
> group by GROUPING SETS (
> (),
> (a,b,d),
> (a,c),
> (a,d)
> )
> """).show(false)
> {code}
> This returns:
> {noformat}
> +----+----+----+----+--------+---+-------+
> |a |b |c |d |count(1)|gid|gid_bin|
> +----+----+----+----+--------+---+-------+
> |a |null|c |null|4 |6 |110 |
> |null|null|null|null|4 |15 |1111 |
> |a |null|null|d |4 |5 |101 |
> |a |b |null|d |4 |1 |1 |
> +----+----+----+----+--------+---+-------+
> {noformat}
>
> In other words, I would have expected the excluded values one way but I
> received them excluded in the order they were first seen in the specified
> grouping sets.
> a,b,d included = excldes c = 2; expected gid=2. received gid=1
> a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c)
> {code:scala}
> spark.sql("""
> select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid,
> bin(grouping_id(a,b,d,c)) as gid_bin
> from abc
> group by GROUPING SETS (
> (),
> (a,b,d),
> (a,c),
> (a,d)
> )
> """).show(false)
> {code}
> columns forming groupingid seem to be creatred as the grouping sets are
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other
> rdbms and I believe the spark project should use a policy of flipping the
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it
> by the ordinal position recognized as fields in the select clause, rather
> than allocating them as they are observed in the grouping sets.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]