[GitHub] [arrow] douglas-raillard-arm opened a new issue, #14810: UnionArray.type_codes is never null

GitBox Thu, 01 Dec 2022 08:27:25 -0800


douglas-raillard-arm opened a new issue, #14810:
URL: https://github.com/apache/arrow/issues/14810


   ### Describe the enhancement requested
   
   I am currently working on a project with lots of `UnionArray` coming from 
Rust's enum that I need to turn into a pandas dataframe using pyarrow. 
   
   To do so, I needed to extract the `UnionArray.type_codes` in order to 
display the the enum variant names. Currently,  `UnionArray.type_codes` does 
not have any null mask and can have value `0` in 2 circumstances:
   1. When encoding the variant #0 (expected)
   2. When the row is null.
   
   This makes the value 0 ambiguous. I worked around using this recipe:
   ```python
   
   # arr is a UnionArray
   arr = ...
   
   
   # Map type codes to the enum variant name
   tags = pa.array([
        arr.type.field(i).name
        for i in arr.type.type_codes
   ])
   
   first = struct_field(arr, [0])
   
   tag_array = pa.DictionaryArray.from_arrays(
        # Use a numpy view in the array as "mask" parameters is
        # currently unsupported for pyarrow arrays.
        arr.type_codes.to_numpy(),
        dictionary=tags,
        # type_code == 0 encodes both for the first variant and
        # also lack of data. The only way to distinguish both
        # is to look at the the bool column associated with variant
        # 0 and check if it is null or something else.
        mask=pa.compute.and_(
                first.is_null(),
                pa.compute.equal(arr.type_codes, 0)
        ).to_pandas()
   )
   ```
   
   If `arr.type_codes` had appropriate null mask, computing this by hand and 
playing type-tetris with numpy/pandas  would not be necessary.
   
   Note that even if `type_codes` itself cannot be changed because it's simply 
a zero-copy view on existing memory, it might still be convenient to introduce 
a method to get the fixed up version.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] douglas-raillard-arm opened a new issue, #14810: UnionArray.type_codes is never null

Reply via email to