[jira] [Updated] (SPARK-36452) Add the support in Spark for having group by map datatype column for the scenario that works in Hive

Saurabh Chawla (Jira) Sun, 08 Aug 2021 06:45:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Saurabh Chawla updated SPARK-36452:
-----------------------------------
    Description: 
Add the support in Spark for having group by map datatype column for the 
scenario that works in Hive.

In hive the below scenario works 
{code:java}
describe extended complex2;
OK
id                  string 
c1                  map<int, string>   
Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc, 
createTime:1627994412, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), 
FieldSchema(name:c1, type:map<int,string>, comment:null)], 
location:/user/hive/warehouse/complex2, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,serdeInfo:SerDeInfo(name:null,
 serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=1})

select * from complex2;
OK
1 {1:"u"}
2 {1:"u",2:"uo"}
1 {1:"u",2:"uo"}
Time taken: 0.363 seconds, Fetched: 3 row(s)

Working Scenario in Hive -: 

select id, c1, count(*) from complex2 group by id, c1;
OK
1 {1:"u"} 1
1 {1:"u",2:"uo"} 1
2 {1:"u",2:"uo"} 1
Time taken: 1.621 seconds, Fetched: 3 row(s)

Failed Scenario in Hive -: failed when map type is present in aggregated 
expression 

select id, max(c1), count(*) from complex2 group by id, c1;

FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or 
complex type containing map<>.
{code}
But in spark this scenario where the group by map column failed for this 
scenario where the map column is used in the select without any aggregation
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
org.apache.spark.sql.AnalysisException: expression 
spark_catalog.default.complex2.`c1` cannot be used as a grouping expression 
because its data type map<int,string> is not an orderable data type.;
Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
+- SubqueryAlias spark_catalog.default.complex2
 +- HiveTableRelation [`default`.`complex2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], 
Partition Cols: []]
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
{code}
There is need to add the this scenario where grouping expression can have map 
type if aggregated expression does not have the that map type reference. This 
helps in migrating the user from hive to Spark.

After the code change 
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
+---+-----------------+--------+                                                
| id|               c1|count(1)|
+---+-----------------+--------+
|  1|         {1 -> u}|       1|
|  2|{1 -> u, 2 -> uo}|       1|
|  1|{1 -> u, 2 -> uo}|       1|
+---+-----------------+--------+
 {code}

  was:
Add the support in Spark for having group by map datatype column for the 
scenario that works in Hive.

In hive the below scenario works 
{code:java}
describe extended complex2;
OK
id                  string 
c1                  map<int, string>   
Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc, 
createTime:1627994412, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), 
FieldSchema(name:c1, type:map<int,string>, comment:null)], 
location:/user/hive/warehouse/complex2, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,serdeInfo:SerDeInfo(name:null,
 serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=1})

select * from complex2;
OK
1 {1:"u"}
2 {1:"u",2:"uo"}
1 {1:"u",2:"uo"}
Time taken: 0.363 seconds, Fetched: 3 row(s)

select id, c1, count(*) from complex2 group by id, c1;
OK
1 {1:"u"} 1
1 {1:"u",2:"uo"} 1
2 {1:"u",2:"uo"} 1
Time taken: 1.621 seconds, Fetched: 3 row(s)

failed when map type is present in aggregated expression 
select id, max(c1), count(*) from complex2 group by id, c1; 
FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or 
complex type containing map<>.
{code}
But in spark this scenario where the group by map column failed for this 
scenario where the map column is used in the select without any aggregation
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
org.apache.spark.sql.AnalysisException: expression 
spark_catalog.default.complex2.`c1` cannot be used as a grouping expression 
because its data type map<int,string> is not an orderable data type.;
Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
+- SubqueryAlias spark_catalog.default.complex2
 +- HiveTableRelation [`default`.`complex2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], 
Partition Cols: []]
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
{code}
There is need to add the this scenario where grouping expression can have map 
type if aggregated expression does not have the that map type reference. This 
helps in migrating the user from hive to Spark.

After the code change 
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
+---+-----------------+--------+                                                
| id|               c1|count(1)|
+---+-----------------+--------+
|  1|         {1 -> u}|       1|
|  2|{1 -> u, 2 -> uo}|       1|
|  1|{1 -> u, 2 -> uo}|       1|
+---+-----------------+--------+
 {code}


> Add the support in Spark for having group by map datatype column for the 
> scenario that works in Hive
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36452
>                 URL: https://issues.apache.org/jira/browse/SPARK-36452
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.3, 3.1.2, 3.2.0
>            Reporter: Saurabh Chawla
>            Priority: Major
>
> Add the support in Spark for having group by map datatype column for the 
> scenario that works in Hive.
> In hive the below scenario works 
> {code:java}
> describe extended complex2;
> OK
> id                  string 
> c1                  map<int, string>   
> Detailed Table Information Table(tableName:complex2, dbName:default, 
> owner:abc, createTime:1627994412, lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), 
> FieldSchema(name:c1, type:map<int,string>, comment:null)], 
> location:/user/hive/warehouse/complex2, 
> inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,serdeInfo:SerDeInfo(name:null,
>  serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1})
> select * from complex2;
> OK
> 1 {1:"u"}
> 2 {1:"u",2:"uo"}
> 1 {1:"u",2:"uo"}
> Time taken: 0.363 seconds, Fetched: 3 row(s)
> Working Scenario in Hive -: 
> select id, c1, count(*) from complex2 group by id, c1;
> OK
> 1 {1:"u"} 1
> 1 {1:"u",2:"uo"} 1
> 2 {1:"u",2:"uo"} 1
> Time taken: 1.621 seconds, Fetched: 3 row(s)
> Failed Scenario in Hive -: failed when map type is present in aggregated 
> expression 
> select id, max(c1), count(*) from complex2 group by id, c1;
> FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or 
> complex type containing map<>.
> {code}
> But in spark this scenario where the group by map column failed for this 
> scenario where the map column is used in the select without any aggregation
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> org.apache.spark.sql.AnalysisException: expression 
> spark_catalog.default.complex2.`c1` cannot be used as a grouping expression 
> because its data type map<int,string> is not an orderable data type.;
> Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
> +- SubqueryAlias spark_catalog.default.complex2
>  +- HiveTableRelation [`default`.`complex2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], 
> Partition Cols: []]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
> {code}
> There is need to add the this scenario where grouping expression can have map 
> type if aggregated expression does not have the that map type reference. This 
> helps in migrating the user from hive to Spark.
> After the code change 
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> +---+-----------------+--------+                                              
>   
> | id|               c1|count(1)|
> +---+-----------------+--------+
> |  1|         {1 -> u}|       1|
> |  2|{1 -> u, 2 -> uo}|       1|
> |  1|{1 -> u, 2 -> uo}|       1|
> +---+-----------------+--------+
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-36452) Add the support in Spark for having group by map datatype column for the scenario that works in Hive

Reply via email to