samuelcolvin opened a new issue, #6247: URL: https://github.com/apache/arrow-rs/issues/6247
Continuing from https://github.com/apache/arrow-rs/pull/6218#pullrequestreview-2236585058 — I thought it worth creating a dedicated issue to discuss this before writing any more code. Well `pyarrow` doesn't help much (or maybe it helps a lot by giving us flexibility!) All four cases fail: ``` Unsupported cast to sparse_union<int_field: int32=0, string_field: string=1> from int32 Unsupported cast to dense_union<int_field: int32=0, string_field: string=1> from int32 Unsupported cast from sparse_union<0: int32=0, 1: string=1> to int32 using function cast_int32 Unsupported cast from dense_union<0: int64=0, 1: bool=1> to int32 using function cast_int32 ``` <details> <summary>Python Code</summary> ```py import pyarrow as pa int_array = pa.array([1, 2, 3, 4, 5], type=pa.int32()) union_fields = [ pa.field('int_field', pa.int32()), pa.field('string_field', pa.string()) ] try: print(int_array.cast(pa.union(union_fields, mode='sparse'))) except Exception as e: print(e) else: print('success') try: print(int_array.cast(pa.union(union_fields, mode='dense'))) except Exception as e: print(e) else: print('success') sparse_indices = pa.array([0, 1, 0, 1, 0], type=pa.int8()) sparse_children = [ pa.array([1, None, 3, None, None], type=pa.int32()), pa.array([None, 'a', None, 'b', None], type=pa.string()), ] sparse_union_array = pa.UnionArray.from_sparse(sparse_indices, sparse_children) # print(sparse_union_array) try: print(sparse_union_array.cast(pa.int32())) except Exception as e: print(e) else: print('success') dense_types = pa.array([0, 1, 1, 0, 0], type=pa.int8()) dense_offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32()) dense_children = [ pa.array([5, 6, 7]), pa.array([False, True]), ] dense_union_array = pa.UnionArray.from_dense(dense_types, dense_offsets, dense_children) # print(dense_union_array) try: print(dense_union_array.cast(pa.int32())) except Exception as e: print(e) else: print('success') ``` </details> --- Here's my proposal for what we support and don't support (yet): ### Casting to sparse and dense union We choose the most appropriate child to cast to using the current logic - choose the exact matching type, otherwise the first type you can cast to, left to right. I think this is fairly simple, uncontroversial and already implemented in #6218. ### Casting from sparse and dense unions I think we can support both sparse and dense using either `zip`, `interleave` or `take` — any suggestion on which will be fastest much appreciated. We can do this, either: 1. requiring one or more fields to be castable to the output type, and just casting those children, leaving values associated with other children `null` 2. or, requiring all fields to be castable I think @alamb suggested he'd prefer 2., I started implementing 1. in #6218 — this is so we can use this union cast logic for [`datafusion-functions-json`](https://github.com/datafusion-contrib/datafusion-functions-json), to match postgres behaviour. When the user queries: ```sql select count(*) from foo where (thing->'field')::int=4 ``` The value returned from `thing->'field'` is a `JsonUnion`, hence I need that to be cast to an int even though that union includes stuff like string, object and array that can't be cast to an int. (I'm trying to roughly match PostgreSQL where `select ('{"foo": 123}'::jsonb->'foo')::int` is valid) If we go with route 2. above, this expression would raise an error. **Note**: for the above case of `(thing->'field')::int`, we already do an optimisation pass where we convert `json_get_union(thing, 'field')::int` to `json_get_int(thing, 'field')` and therefore avoid this problem. My reason for implementing casting from unions in the first place was to support expression where `JsonUnion` is compared to values, but the optimization won't or can't work, e.g. if `thing->'field'` is in a CTE, then used later. I guess if we decide that route 2. is correct, I have a few options: * I might able to use query rewriting to rewrite all cases of casting from `JsonUnion`, e.g. replace all casts in the query with a UDF that does something custom for `JsonUnion` * I could wait for logical types https://github.com/apache/datafusion/issues/11513 and use them to control casting? * We could introduce config on a union to control casting behaviour, that seems like an extension of arrow and therefore unlikely to happen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org