jorisvandenbossche opened a new issue, #14970:
URL: https://github.com/apache/arrow/issues/14970

   Related to https://github.com/apache/arrow/issues/14946 on the C++ side, and 
this recently came up in 
https://github.com/apache/arrow/pull/14781#issuecomment-1339114243.
   
   A StructArray has child arrays that make up its "fields", but in addition it 
can also have a top-level validity bitmap. So when accessing a field of a 
StructArray that has such top-level nulls, you can retrieve the "raw" child 
array or you can get the "logical" field array that combines the child array 
with the top-level bitmap. 
   
   To illustrate:
   
   ```
   In [1]: arr = pa.StructArray.from_arrays([pa.array([5, 3, 4, 2, 1]), 
pa.array([1, 2, 3, 4, 5])], names=['a', 'b'], mask=pa.array([False, True, 
False, False, False]))
   
   In [2]: arr.to_pandas()
   Out[2]: 
   0    {'a': 5, 'b': 1}
   1                None
   2    {'a': 4, 'b': 3}
   3    {'a': 2, 'b': 4}
   4    {'a': 1, 'b': 5}
   dtype: object
   
   In [3]: arr.field('a')
   Out[3]: 
   <pyarrow.lib.Int64Array object at 0x7f9db84cdd20>
   [
     5,
     3,
     4,
     2,
     1
   ]
   
   In [4]: arr.flatten()[0]
   Out[4]: 
   <pyarrow.lib.Int64Array object at 0x7f9db855f400>
   [
     5,
     null,
     4,
     2,
     1
   ]
   ```
   
   Currently, the `field()` method on a StructArray gives you the raw child 
array, and there is a `flatten()` method that returns those "logical" field 
arrays for all the fields as a list of arrays. 
   We should have a method with which you can get the field array for a single 
field instead of having to use `flatten()`, and in 
https://github.com/apache/arrow/pull/14781, @amol- added a `_flattened_field` 
(private for now, but we needed it to get the correct values to sort by):
   
   ```
   In [5]: arr._flattened_field('a')
   Out[5]: 
   <pyarrow.lib.Int64Array object at 0x7f9db85d9780>
   [
     5,
     null,
     4,
     2,
     1
   ]
   ```
   
   We could just make that a public method instead, however, some 
questions/concerns about this:
   
   - I personally don't like the "flattened" term. I know we already use this 
in C++ as well (this basically just exposes the C++ 
`StructArray::GetFlattenedField`), but I don't find it very clear that it means 
this distinction. 
   - We could also change `field()` instead? I personally think this is what 
people typically will want when they currently call `field` (like @amol-  was 
doing in the sort PR, to get the values of a certain field of the struct). The 
value in the raw child that is being masked by the top-level bitmap is kind of 
an implementation detail, and IMO a user should get that so easily.
   - If we would change `field()` to default to the "flattened" field, we need 
an alternative to access the raw child. We could add a keyword for this? (but 
what name?) Or a separate method like `child()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to