rishvin commented on PR #2312: URL: https://github.com/apache/datafusion-comet/pull/2312#issuecomment-3283141180
> Thanks @rishvin what exactly spark sql function this PR addresses? I checked https://spark.apache.org/docs/latest/api/sql/search.html?q=map I dont see map_to_list > > Are you referring to https://spark.apache.org/docs/latest/api/sql/#map_entries ? If so this function already supported Hey @comphead : This is not spark sql function. We are using it internally to support grouping on map type. I know there has been some confusion around `MapSort` lately, hence providing the full context here regarding the original PR: https://github.com/apache/datafusion-comet/pull/2221. There is also subtle difference between `map_to_list` and `map_entries`, mentioned later. Spark4.0+ when doing grouping on `Map` type applies a logical rule to wrap the grouping column with `MapSort` expression. This expression class sorts the Map elements by keys, such that the output performed after evaluation is an ordered map. This ordered map can then be supplied for grouping because it will produce the correct result. For eg. these 2 map elements - `{a: 1, b: 2}` and `{b:2, a:1}`, regardless of ordering of keys in the map are essentially the same, hence should be part of the same group. The `MapSort` will sort them as `{a: 1, b: 2}` and `groupby` will put them in the same group `{a: 1, b: 2}`. Today, writing only the equivalent `MapSort` expression in Comet along will not work. This is because DataFusion does not directly support grouping on `Map` type. If we attempt to pass a `Map` type to DataFusion, it will throw with `Not implemented ...` error. However, DataFusion does support grouping on a more canonical form like `List`. So, if we change the `Map` type to `List` type, then Datafusion will work and give the correct result. To do so, I added this internal function `map_to_list` that will convert the `Map` type to the `List` type before passing it to the grouping expression. This translation from map to list type is done [here](https://github.com/rishvin/datafusion-comet/blob/4c580f829fae74c4b5a9b3feb277a32cd94937dc/native/core/src/execution/utils.rs#L160), when building the grouping expressions. But now passing `map_to_list` to the grouping expression has one problem, the output grouping keys produced by the `HashAggregate` will be a `List` instead of `Map`. The result produced cannot be directly consumed by the downstream operators because they expect the grouping key type to be `Map`, however, we are passing them `List` type. We will hit schema-mismatch exception. So, we must convert the `List` type back to `Map` again. This translation is done by `map_from_list` internal function, which convert the output grouping keys back to `Map` type. This function is implemented in part3 [here](https://github.com/apache/datafusion-comet/pull/2328). And the conversion of `List` to `Map` type happens [here](https://github.com/rishvin/datafusion-comet/blob/4c580f829fae74c4b5a9b3feb277a32cd94937dc/native/core/src/execution/utils.rs#L209) by applying a projection on top of `HashAggregate`. These conversation are totally done in native code, localized to `HashAggregate`. Since, we are maintain the schema requirements, the result produced by the `HashAggregate` can be consumed downstream Spark operators also. --- Regarding the second part of using `map_entries`: I checked the implementation in Datafusion, [here](https://github.com/rishvin/datafusion/blob/407a965d3740634d582376aa22c4b2b57da6f005/datafusion/functions-nested/src/map_entries.rs#L137-L137), which has one difference. It doesn't carry forward the field names. It is creating key field with name `key` and value field with name `value`. This means, if the original map has field names as `keys` and `values` (plural), they will not be retained when doing this conversion. Also, it does not carry forward `sorted` flag which is very specific to `Map` type. The `map_to_list` addresses this by cloning the incoming Map's field and piggybacking the `sorted` flag in the metadata hashmap of the field. This is being done [here](https://github.com/rishvin/datafusion-comet/blob/da1a397395b5deadc99408b6b8325f6cdef20bb9/native/spark-expr/src/map_funcs/map_to_list.rs#L72). And when converting the list back to map, the original field and metadata is consumed to construct the map type back. This is done [here](https://github.com/rishvin/datafusion-comet/blob/ded06b5c7548560761f21569035e7b4345fb3ca2/native/spark-expr/src/map_funcs/map_from_list.rs#L90). Retaining these metadata information is important, otherwise, we may run into schema mismatch issues. I ran into some schema related issue when implementing this. So, I think we may be able to use `map_entries` but we should retain the field and map specific metadata. Let me know if you have any further questions and your thoughts around this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org