rishvin commented on PR #2312:
URL: 
https://github.com/apache/datafusion-comet/pull/2312#issuecomment-3283141180

   > Thanks @rishvin what exactly spark sql function this PR addresses? I 
checked https://spark.apache.org/docs/latest/api/sql/search.html?q=map I dont 
see map_to_list
   > 
   > Are you referring to 
https://spark.apache.org/docs/latest/api/sql/#map_entries ? If so this function 
already supported
   
   Hey @comphead : This is not spark sql function. We are using it internally 
to support grouping on map type. 
   
   I know there has been some confusion around `MapSort` lately, hence 
providing the full context here regarding the original PR: 
https://github.com/apache/datafusion-comet/pull/2221. There is also subtle 
difference between `map_to_list` and `map_entries`, mentioned later.
   
   Spark4.0+ when doing grouping on `Map` type applies a logical rule to wrap 
the grouping column with `MapSort` expression. This expression class sorts the 
Map elements by keys, such that the output performed after evaluation is an 
ordered map. This ordered map can then be supplied for grouping because it will 
produce the correct result. For eg. these 2 map elements - `{a: 1, b: 2}` and 
`{b:2, a:1}`, regardless of ordering of keys in the map are essentially the 
same, hence should be part of the same group. The `MapSort` will sort them as 
`{a: 1, b: 2}` and  `groupby` will put them in the same group `{a: 1, b: 2}`.
   
   Today, writing only the equivalent `MapSort` expression in Comet along will 
not work. This is because DataFusion does not directly support grouping on 
`Map` type. If we attempt to pass a `Map` type to DataFusion, it will throw 
with `Not implemented ...` error. However, DataFusion does support grouping on 
a more canonical form like `List`. So, if we change the `Map` type to `List` 
type, then Datafusion will  work and give the correct result. To do so, I added 
this internal function `map_to_list` that will convert the `Map` type to the 
`List` type before passing it to the grouping expression. This translation from 
map to list type is done 
[here](https://github.com/rishvin/datafusion-comet/blob/4c580f829fae74c4b5a9b3feb277a32cd94937dc/native/core/src/execution/utils.rs#L160),
 when building the grouping expressions.
   
   But now passing `map_to_list` to the grouping expression has one problem, 
the output grouping keys produced by the `HashAggregate` will be a `List` 
instead of `Map`. The result produced cannot be directly consumed by the 
downstream operators because they expect the grouping key type to be `Map`, 
however, we are passing them `List` type. We will hit schema-mismatch 
exception. So, we must convert the `List` type back to `Map` again. This 
translation is done by `map_from_list` internal function, which convert the 
output grouping keys back to `Map` type. This function is implemented in part3 
[here](https://github.com/apache/datafusion-comet/pull/2328). And the 
conversion of `List` to `Map` type happens 
[here](https://github.com/rishvin/datafusion-comet/blob/4c580f829fae74c4b5a9b3feb277a32cd94937dc/native/core/src/execution/utils.rs#L209)
 by applying a projection on top of `HashAggregate`.
   
   
   These conversation are  totally done in native code, localized to 
`HashAggregate`. Since, we are maintain the schema requirements, the result 
produced by the `HashAggregate` can be consumed downstream Spark operators also.
   
   --- 
   Regarding the second part of using `map_entries`:
   
   I checked the implementation in Datafusion, 
[here](https://github.com/rishvin/datafusion/blob/407a965d3740634d582376aa22c4b2b57da6f005/datafusion/functions-nested/src/map_entries.rs#L137-L137),
 which has one difference. It doesn't carry forward the field names. It is 
creating key field with name `key` and value field with name `value`. This 
means, if the original map has field names as `keys` and `values` (plural), 
they will not be retained when doing this conversion. Also, it does not carry 
forward `sorted` flag which is very specific to `Map` type. 
   
   The `map_to_list` addresses this by cloning the incoming Map's field and 
piggybacking the `sorted` flag in the metadata hashmap of the field. This is 
being done 
[here](https://github.com/rishvin/datafusion-comet/blob/da1a397395b5deadc99408b6b8325f6cdef20bb9/native/spark-expr/src/map_funcs/map_to_list.rs#L72).
   
   And when converting the list back to map, the original field and metadata is 
consumed to construct the map type back. This is done 
[here](https://github.com/rishvin/datafusion-comet/blob/ded06b5c7548560761f21569035e7b4345fb3ca2/native/spark-expr/src/map_funcs/map_from_list.rs#L90).
   
   Retaining these metadata information is important, otherwise, we may run 
into schema mismatch issues. I ran into some schema related issue when 
implementing this.
   
   So, I think we may be able to use `map_entries` but we should retain the 
field and map specific metadata.
   
   Let me know if you have any further questions and your thoughts around this.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to