[jira] [Assigned] (SPARK-36452) Add the support in Spark for having group by map datatype column for the scenario that works in Hive

Apache Spark (Jira) Sun, 08 Aug 2021 06:28:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-36452:
------------------------------------

    Assignee:     (was: Apache Spark)

> Add the support in Spark for having group by map datatype column for the 
> scenario that works in Hive
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36452
>                 URL: https://issues.apache.org/jira/browse/SPARK-36452
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.3, 3.1.2, 3.2.0
>            Reporter: Saurabh Chawla
>            Priority: Major
>
> Add the support in Spark for having group by map datatype column for the 
> scenario that works in Hive.
> In hive the below scenario works 
> {code:java}
> describe extended complex2;
> OK
> id                  string 
> c1                  map<int, string>   
> Detailed Table Information Table(tableName:complex2, dbName:default, 
> owner:abc, createTime:1627994412, lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), 
> FieldSchema(name:c1, type:map<int,string>, comment:null)], 
> location:/user/hive/warehouse/complex2, 
> inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,serdeInfo:SerDeInfo(name:null,
>  serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1})
> select * from complex2;
> OK
> 1 {1:"u"}
> 2 {1:"u",2:"uo"}
> 1 {1:"u",2:"uo"}
> Time taken: 0.363 seconds, Fetched: 3 row(s)
> select id, c1, count(*) from complex2 group by id, c1;
> OK
> 1 {1:"u"} 1
> 1 {1:"u",2:"uo"} 1
> 2 {1:"u",2:"uo"} 1
> Time taken: 1.621 seconds, Fetched: 3 row(s)
> failed when map type is present in aggregated expression 
> select id, max(c1), count(*) from complex2 group by id, c1; 
> FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or 
> complex type containing map<>.
> {code}
> But in spark this scenario where the group by map column failed for this 
> scenario where the map column is used in the select without any aggregation
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> org.apache.spark.sql.AnalysisException: expression 
> spark_catalog.default.complex2.`c1` cannot be used as a grouping expression 
> because its data type map<int,string> is not an orderable data type.;
> Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
> +- SubqueryAlias spark_catalog.default.complex2
>  +- HiveTableRelation [`default`.`complex2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], 
> Partition Cols: []]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
> {code}
> There is need to add the this scenario where grouping expression can have map 
> type if aggregated expression does not have the that map type reference. This 
> helps in migrating the user from hive to Spark.
> After the code change 
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> +---+-----------------+--------+                                              
>   
> | id|               c1|count(1)|
> +---+-----------------+--------+
> |  1|         {1 -> u}|       1|
> |  2|{1 -> u, 2 -> uo}|       1|
> |  1|{1 -> u, 2 -> uo}|       1|
> +---+-----------------+--------+
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (SPARK-36452) Add the support in Spark for having group by map datatype column for the scenario that works in Hive

Reply via email to