douglas-raillard-arm opened a new issue, #14810:
URL: https://github.com/apache/arrow/issues/14810
### Describe the enhancement requested
I am currently working on a project with lots of `UnionArray` coming from
Rust's enum that I need to turn into a pandas dataframe using pyarrow.
To do so, I needed to extract the `UnionArray.type_codes` in order to
display the the enum variant names. Currently, `UnionArray.type_codes` does
not have any null mask and can have value `0` in 2 circumstances:
1. When encoding the variant #0 (expected)
2. When the row is null.
This makes the value 0 ambiguous. I worked around using this recipe:
```python
# arr is a UnionArray
arr = ...
# Map type codes to the enum variant name
tags = pa.array([
arr.type.field(i).name
for i in arr.type.type_codes
])
first = struct_field(arr, [0])
tag_array = pa.DictionaryArray.from_arrays(
# Use a numpy view in the array as "mask" parameters is
# currently unsupported for pyarrow arrays.
arr.type_codes.to_numpy(),
dictionary=tags,
# type_code == 0 encodes both for the first variant and
# also lack of data. The only way to distinguish both
# is to look at the the bool column associated with variant
# 0 and check if it is null or something else.
mask=pa.compute.and_(
first.is_null(),
pa.compute.equal(arr.type_codes, 0)
).to_pandas()
)
```
If `arr.type_codes` had appropriate null mask, computing this by hand and
playing type-tetris with numpy/pandas would not be necessary.
Note that even if `type_codes` itself cannot be changed because it's simply
a zero-copy view on existing memory, it might still be convenient to introduce
a method to get the fixed up version.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]