[
https://issues.apache.org/jira/browse/SPARK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Saurabh Chawla updated SPARK-36452:
-----------------------------------
Description:
Add the support in Spark for having group by map datatype column for the
scenario that works in Hive.
In hive the below scenario works
{code:java}
describe extended complex2;
OK
id string
c1 map<int, string>
Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc,
createTime:1627994412, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null),
FieldSchema(name:c1, type:map<int,string>, comment:null)],
location:/user/hive/warehouse/complex2,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1})
select * from complex2;
OK
1 {1:"u"}
2 {1:"u",2:"uo"}
1 {1:"u",2:"uo"}
Time taken: 0.363 seconds, Fetched: 3 row(s)
select id, c1, count(*) from complex2 group by id, c1;
OK
1 {1:"u"} 1
1 {1:"u",2:"uo"} 1
2 {1:"u",2:"uo"} 1
Time taken: 1.621 seconds, Fetched: 3 row(s)
failed when map type is present in aggregated expression
select id, max(c1), count(*) from complex2 group by id, c1;
FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or
complex type containing map<>.
{code}
But in spark this scenario where the group by map column failed for this
scenario where the map column is used in the select without any aggregation
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
org.apache.spark.sql.AnalysisException: expression
spark_catalog.default.complex2.`c1` cannot be used as a grouping expression
because its data type map<int,string> is not an orderable data type.;
Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
+- SubqueryAlias spark_catalog.default.complex2
+- HiveTableRelation [`default`.`complex2`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2],
Partition Cols: []]
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
{code}
There is need to add the this scenario where grouping expression can have map
type if aggregated expression does not have the that map type reference. This
helps in migrating the user from hive to Spark.
After the code change
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
+---+-----------------+--------+
| id| c1|count(1)|
+---+-----------------+--------+
| 1| {1 -> u}| 1|
| 2|{1 -> u, 2 -> uo}| 1|
| 1|{1 -> u, 2 -> uo}| 1|
+---+-----------------+--------+
{code}
was:
Add the support in Spark for having group by map datatype column for the
scenario that works in Hive.
In hive the below scenario works
{code:java}
describe extended complex2;
OK
id string
c1 map<int, string>
Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc,
createTime:1627994412, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null),
FieldSchema(name:c1, type:map<int,string>, comment:null)],
location:/user/hive/warehouse/complex2,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1})
select * from complex2;
OK
1 {1:"u"}
2 {1:"u",2:"uo"}
1 {1:"u",2:"uo"}
Time taken: 0.363 seconds, Fetched: 3 row(s)
select id, c1, count(*) from complex2 group by id, c1;
OK
1 {1:"u"} 1
1 {1:"u",2:"uo"} 1
2 {1:"u",2:"uo"} 1
Time taken: 1.621 seconds, Fetched: 3 row(s)
failed when map type is present in aggregated expression
select id, max(c1), count(*) from complex2 group by id, c1;
FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or
complex type containing map<>.
{code}
But in spark this scenario where the group by map column failed for this
scenario where the map column is used in the select without any aggregation
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
org.apache.spark.sql.AnalysisException: expression
spark_catalog.default.complex2.`c1` cannot be used as a grouping expression
because its data type map<int,string> is not an orderable data type.;
Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
+- SubqueryAlias spark_catalog.default.complex2
+- HiveTableRelation [`default`.`complex2`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2],
Partition Cols: []]
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
{code}
There is need to add the this scenario where grouping expression can have map
type if aggregated expression does not have the that map type reference. This
helps in migrating the user from hive to Spark.
After the code change
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
+---+-----------------+--------+
| id| c1|count(1)|
+---+-----------------+--------+
| 1| {1 -> u}| 1|
| 2|{1 -> u, 2 -> uo}| 1|
| 1|{1 -> u, 2 -> uo}| 1|
+---+-----------------+--------+
{code}
> Add the support in Spark for having group by map datatype column for the
> scenario that works in Hive
> ----------------------------------------------------------------------------------------------------
>
> Key: SPARK-36452
> URL: https://issues.apache.org/jira/browse/SPARK-36452
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.3, 3.1.2, 3.2.0
> Reporter: Saurabh Chawla
> Priority: Major
>
> Add the support in Spark for having group by map datatype column for the
> scenario that works in Hive.
> In hive the below scenario works
> {code:java}
> describe extended complex2;
> OK
> id string
> c1 map<int, string>
> Detailed Table Information Table(tableName:complex2, dbName:default,
> owner:abc, createTime:1627994412, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null),
> FieldSchema(name:c1, type:map<int,string>, comment:null)],
> location:/user/hive/warehouse/complex2,
> inputFormat:org.apache.hadoop.mapred.TextInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
> parameters:{serialization.format=1})
> select * from complex2;
> OK
> 1 {1:"u"}
> 2 {1:"u",2:"uo"}
> 1 {1:"u",2:"uo"}
> Time taken: 0.363 seconds, Fetched: 3 row(s)
> select id, c1, count(*) from complex2 group by id, c1;
> OK
> 1 {1:"u"} 1
> 1 {1:"u",2:"uo"} 1
> 2 {1:"u",2:"uo"} 1
> Time taken: 1.621 seconds, Fetched: 3 row(s)
> failed when map type is present in aggregated expression
> select id, max(c1), count(*) from complex2 group by id, c1;
> FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or
> complex type containing map<>.
> {code}
> But in spark this scenario where the group by map column failed for this
> scenario where the map column is used in the select without any aggregation
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> org.apache.spark.sql.AnalysisException: expression
> spark_catalog.default.complex2.`c1` cannot be used as a grouping expression
> because its data type map<int,string> is not an orderable data type.;
> Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
> +- SubqueryAlias spark_catalog.default.complex2
> +- HiveTableRelation [`default`.`complex2`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2],
> Partition Cols: []]
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
> {code}
> There is need to add the this scenario where grouping expression can have map
> type if aggregated expression does not have the that map type reference. This
> helps in migrating the user from hive to Spark.
> After the code change
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> +---+-----------------+--------+
>
> | id| c1|count(1)|
> +---+-----------------+--------+
> | 1| {1 -> u}| 1|
> | 2|{1 -> u, 2 -> uo}| 1|
> | 1|{1 -> u, 2 -> uo}| 1|
> +---+-----------------+--------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]