jorisvandenbossche opened a new issue, #14970: URL: https://github.com/apache/arrow/issues/14970
Related to https://github.com/apache/arrow/issues/14946 on the C++ side, and this recently came up in https://github.com/apache/arrow/pull/14781#issuecomment-1339114243. A StructArray has child arrays that make up its "fields", but in addition it can also have a top-level validity bitmap. So when accessing a field of a StructArray that has such top-level nulls, you can retrieve the "raw" child array or you can get the "logical" field array that combines the child array with the top-level bitmap. To illustrate: ``` In [1]: arr = pa.StructArray.from_arrays([pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'], mask=pa.array([False, True, False, False, False])) In [2]: arr.to_pandas() Out[2]: 0 {'a': 5, 'b': 1} 1 None 2 {'a': 4, 'b': 3} 3 {'a': 2, 'b': 4} 4 {'a': 1, 'b': 5} dtype: object In [3]: arr.field('a') Out[3]: <pyarrow.lib.Int64Array object at 0x7f9db84cdd20> [ 5, 3, 4, 2, 1 ] In [4]: arr.flatten()[0] Out[4]: <pyarrow.lib.Int64Array object at 0x7f9db855f400> [ 5, null, 4, 2, 1 ] ``` Currently, the `field()` method on a StructArray gives you the raw child array, and there is a `flatten()` method that returns those "logical" field arrays for all the fields as a list of arrays. We should have a method with which you can get the field array for a single field instead of having to use `flatten()`, and in https://github.com/apache/arrow/pull/14781, @amol- added a `_flattened_field` (private for now, but we needed it to get the correct values to sort by): ``` In [5]: arr._flattened_field('a') Out[5]: <pyarrow.lib.Int64Array object at 0x7f9db85d9780> [ 5, null, 4, 2, 1 ] ``` We could just make that a public method instead, however, some questions/concerns about this: - I personally don't like the "flattened" term. I know we already use this in C++ as well (this basically just exposes the C++ `StructArray::GetFlattenedField`), but I don't find it very clear that it means this distinction. - We could also change `field()` instead? I personally think this is what people typically will want when they currently call `field` (like @amol- was doing in the sort PR, to get the values of a certain field of the struct). The value in the raw child that is being masked by the top-level bitmap is kind of an implementation detail, and IMO a user should get that so easily. - If we would change `field()` to default to the "flattened" field, we need an alternative to access the raw child. We could add a keyword for this? (but what name?) Or a separate method like `child()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
