samuelcolvin opened a new issue, #6247:
URL: https://github.com/apache/arrow-rs/issues/6247

   Continuing from 
https://github.com/apache/arrow-rs/pull/6218#pullrequestreview-2236585058 — I 
thought it worth creating a dedicated issue to discuss this before writing any 
more code.
   
   Well `pyarrow` doesn't help much (or maybe it helps a lot by giving us 
flexibility!)
   
   All four cases fail:
   
   ```
   Unsupported cast to sparse_union<int_field: int32=0, string_field: string=1> 
from int32
   Unsupported cast to dense_union<int_field: int32=0, string_field: string=1> 
from int32
   Unsupported cast from sparse_union<0: int32=0, 1: string=1> to int32 using 
function cast_int32
   Unsupported cast from dense_union<0: int64=0, 1: bool=1> to int32 using 
function cast_int32
   ```
   
   <details>
   <summary>Python Code</summary>
   
   ```py
   import pyarrow as pa
   
   int_array = pa.array([1, 2, 3, 4, 5], type=pa.int32())
   
   union_fields = [
       pa.field('int_field', pa.int32()),
       pa.field('string_field', pa.string())
   ]
   try:
       print(int_array.cast(pa.union(union_fields, mode='sparse')))
   except Exception as e:
       print(e)
   else:
       print('success')
   try:
       print(int_array.cast(pa.union(union_fields, mode='dense')))
   except Exception as e:
       print(e)
   else:
       print('success')
   
   
   sparse_indices = pa.array([0, 1, 0, 1, 0], type=pa.int8())
   sparse_children = [
       pa.array([1, None, 3, None, None], type=pa.int32()),
       pa.array([None, 'a', None, 'b', None], type=pa.string()),
   ]
   sparse_union_array = pa.UnionArray.from_sparse(sparse_indices, 
sparse_children)
   # print(sparse_union_array)
   
   try:
       print(sparse_union_array.cast(pa.int32()))
   except Exception as e:
       print(e)
   else:
       print('success')
   
   dense_types = pa.array([0, 1, 1, 0, 0], type=pa.int8())
   dense_offsets = pa.array([0, 0, 1, 1, 2], type=pa.int32())
   dense_children = [
       pa.array([5, 6, 7]),
       pa.array([False, True]),
   ]
   dense_union_array = pa.UnionArray.from_dense(dense_types, dense_offsets, 
dense_children)
   # print(dense_union_array)
   try:
       print(dense_union_array.cast(pa.int32()))
   except Exception as e:
       print(e)
   else:
       print('success')
   ```
   
   </details>
   
   ---
   
   Here's my proposal for what we support and don't support (yet):
   
   ### Casting to sparse and dense union
   
   We choose the most appropriate child to cast to using the current logic - 
choose the exact matching type, otherwise the first type you can cast to, left 
to right.
   
   I think this is fairly simple, uncontroversial and already implemented in 
#6218.
   
   ### Casting from sparse and dense unions
   
   I think we can support both sparse and dense using either `zip`, 
`interleave` or `take` — any suggestion on which will be fastest much 
appreciated.
   
   We can do this, either:
   1. requiring one or more fields to be castable to the output type, and just 
casting those children, leaving values associated with other children `null`
   2. or, requiring all fields to be castable
   
   I think @alamb suggested he'd prefer 2., I started implementing 1. in #6218 
— this is so we can use this union cast logic for 
[`datafusion-functions-json`](https://github.com/datafusion-contrib/datafusion-functions-json),
 to match postgres behaviour.
   
   When the user queries:
   
   ```sql
   select count(*) from foo where (thing->'field')::int=4
   ```
   The value returned from `thing->'field'` is a `JsonUnion`, hence I need that 
to be cast to an int even though that union includes stuff like string, object 
and array that can't be cast to an int.
   
   (I'm trying to roughly match PostgreSQL where `select ('{"foo": 
123}'::jsonb->'foo')::int` is valid)
   
   If we go with route 2. above, this expression would raise an error.
   
   **Note**: for the above case of `(thing->'field')::int`, we already do an 
optimisation pass where we convert `json_get_union(thing, 'field')::int` to 
`json_get_int(thing, 'field')` and therefore avoid this problem. My reason for 
implementing casting from unions in the first place was to support expression 
where `JsonUnion` is compared to values, but the optimization won't or can't 
work, e.g. if `thing->'field'` is in a CTE, then used later.
   
   I guess if we decide that route 2. is correct, I have a few options:
   * I might able to use query rewriting to rewrite all cases of casting from 
`JsonUnion`, e.g. replace all casts in the query with a UDF that does something 
custom for `JsonUnion`
   * I could wait for logical types 
https://github.com/apache/datafusion/issues/11513 and use them to control 
casting?
   * We could introduce config on a union to control casting behaviour, that 
seems like an extension of arrow and therefore unlikely to happen


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to