Re: [PR] feat: [1941-Part2]: Introduce map_to_list scalar function [datafusion-comet]

via GitHub Thu, 11 Sep 2025 18:17:19 -0700


rishvin commented on PR #2312:
URL: 
https://github.com/apache/datafusion-comet/pull/2312#issuecomment-3283141180

> Thanks @rishvin what exactly spark sql function this PR addresses? I
checked https://spark.apache.org/docs/latest/api/sql/search.html?q=map I dont
see map_to_list
>
> Are you referring to
https://spark.apache.org/docs/latest/api/sql/#map_entries ? If so this function
already supported

Hey @comphead : This is not spark sql function. We are using it internally
to support grouping on map type.

I know there has been some confusion around `MapSort` lately, hence
providing the full context here regarding the original PR:
https://github.com/apache/datafusion-comet/pull/2221. There is also subtle
difference between `map_to_list` and `map_entries`, mentioned later.

Spark4.0+ when doing grouping on `Map` type applies a logical rule to wrap
the grouping column with `MapSort` expression. This expression class sorts the
Map elements by keys, such that the output performed after evaluation is an
ordered map. This ordered map can then be supplied for grouping because it will
produce the correct result. For eg. these 2 map elements - `{a: 1, b: 2}` and
`{b:2, a:1}`, regardless of ordering of keys in the map are essentially the
same, hence should be part of the same group. The `MapSort` will sort them as
`{a: 1, b: 2}` and `groupby` will put them in the same group `{a: 1, b: 2}`.

Today, writing only the equivalent `MapSort` expression in Comet along will
not work. This is because DataFusion does not directly support grouping on
`Map` type. If we attempt to pass a `Map` type to DataFusion, it will throw
with `Not implemented ...` error. However, DataFusion does support grouping on
a more canonical form like `List`. So, if we change the `Map` type to `List`
type, then Datafusion will work and give the correct result. To do so, I added
this internal function `map_to_list` that will convert the `Map` type to the
`List` type before passing it to the grouping expression. This translation from
map to list type is done
[here](https://github.com/rishvin/datafusion-comet/blob/4c580f829fae74c4b5a9b3feb277a32cd94937dc/native/core/src/execution/utils.rs#L160),
when building the grouping expressions.

But now passing `map_to_list` to the grouping expression has one problem,
the output grouping keys produced by the `HashAggregate` will be a `List`
instead of `Map`. The result produced cannot be directly consumed by the
downstream operators because they expect the grouping key type to be `Map`,
however, we are passing them `List` type. We will hit schema-mismatch
exception. So, we must convert the `List` type back to `Map` again. This
translation is done by `map_from_list` internal function, which convert the
output grouping keys back to `Map` type. This function is implemented in part3
[here](https://github.com/apache/datafusion-comet/pull/2328). And the
conversion of `List` to `Map` type happens
[here](https://github.com/rishvin/datafusion-comet/blob/4c580f829fae74c4b5a9b3feb277a32cd94937dc/native/core/src/execution/utils.rs#L209)
by applying a projection on top of `HashAggregate`.

These conversation are totally done in native code, localized to
`HashAggregate`. Since, we are maintain the schema requirements, the result
produced by the `HashAggregate` can be consumed downstream Spark operators also.

---
Regarding the second part of using `map_entries`:

I checked the implementation in Datafusion,
[here](https://github.com/rishvin/datafusion/blob/407a965d3740634d582376aa22c4b2b57da6f005/datafusion/functions-nested/src/map_entries.rs#L137-L137),
which has one difference. It doesn't carry forward the field names. It is
creating key field with name `key` and value field with name `value`. This
means, if the original map has field names as `keys` and `values` (plural),
they will not be retained when doing this conversion. Also, it does not carry
forward `sorted` flag which is very specific to `Map` type.

The `map_to_list` addresses this by cloning the incoming Map's field and
piggybacking the `sorted` flag in the metadata hashmap of the field. This is
being done
[here](https://github.com/rishvin/datafusion-comet/blob/da1a397395b5deadc99408b6b8325f6cdef20bb9/native/spark-expr/src/map_funcs/map_to_list.rs#L72).

And when converting the list back to map, the original field and metadata is
consumed to construct the map type back. This is done
[here](https://github.com/rishvin/datafusion-comet/blob/ded06b5c7548560761f21569035e7b4345fb3ca2/native/spark-expr/src/map_funcs/map_from_list.rs#L90).

Retaining these metadata information is important, otherwise, we may run
into schema mismatch issues. I ran into some schema related issue when
implementing this.

So, I think we may be able to use `map_entries` but we should retain the
field and map specific metadata.

Let me know if you have any further questions and your thoughts around this.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: [1941-Part2]: Introduce map_to_list scalar function [datafusion-comet]

Reply via email to