[
https://issues.apache.org/jira/browse/SPARK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-36452:
------------------------------------
Assignee: (was: Apache Spark)
> Add the support in Spark for having group by map datatype column for the
> scenario that works in Hive
> ----------------------------------------------------------------------------------------------------
>
> Key: SPARK-36452
> URL: https://issues.apache.org/jira/browse/SPARK-36452
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.3, 3.1.2, 3.2.0
> Reporter: Saurabh Chawla
> Priority: Major
>
> Add the support in Spark for having group by map datatype column for the
> scenario that works in Hive.
> In hive the below scenario works
> {code:java}
> describe extended complex2;
> OK
> id string
> c1 map<int, string>
> Detailed Table Information Table(tableName:complex2, dbName:default,
> owner:abc, createTime:1627994412, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null),
> FieldSchema(name:c1, type:map<int,string>, comment:null)],
> location:/user/hive/warehouse/complex2,
> inputFormat:org.apache.hadoop.mapred.TextInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> parameters:{serialization.format=1})
> select * from complex2;
> OK
> 1 {1:"u"}
> 2 {1:"u",2:"uo"}
> 1 {1:"u",2:"uo"}
> Time taken: 0.363 seconds, Fetched: 3 row(s)
> select id, c1, count(*) from complex2 group by id, c1;
> OK
> 1 {1:"u"} 1
> 1 {1:"u",2:"uo"} 1
> 2 {1:"u",2:"uo"} 1
> Time taken: 1.621 seconds, Fetched: 3 row(s)
> failed when map type is present in aggregated expression
> select id, max(c1), count(*) from complex2 group by id, c1;
> FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or
> complex type containing map<>.
> {code}
> But in spark this scenario where the group by map column failed for this
> scenario where the map column is used in the select without any aggregation
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> org.apache.spark.sql.AnalysisException: expression
> spark_catalog.default.complex2.`c1` cannot be used as a grouping expression
> because its data type map<int,string> is not an orderable data type.;
> Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
> +- SubqueryAlias spark_catalog.default.complex2
> +- HiveTableRelation [`default`.`complex2`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2],
> Partition Cols: []]
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
> {code}
> There is need to add the this scenario where grouping expression can have map
> type if aggregated expression does not have the that map type reference. This
> helps in migrating the user from hive to Spark.
> After the code change
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> +---+-----------------+--------+
>
> | id| c1|count(1)|
> +---+-----------------+--------+
> | 1| {1 -> u}| 1|
> | 2|{1 -> u, 2 -> uo}| 1|
> | 1|{1 -> u, 2 -> uo}| 1|
> +---+-----------------+--------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]