jorisvandenbossche opened a new issue, #435:
URL: https://github.com/apache/arrow-nanoarrow/issues/435
When you have an Array object and want to inspect the details (buffers,
children, etc), you currently essentially have 2x 2 functions that each give
slightly different information, the `c_array`, `c_array_view`, `c_schema`,
`c_schema_view`.
In the tutorial material we are writing, we are mostly using `c_array_view`
as that gives the most comprehensive and understandable repr.
Example:
```python
In [1]: import nanoarrow as na
In [2]: pa_arr = pa.array([[0, 1], [None, 3], None, [4]])
In [3]: arr = na.Array(pa_arr)
In [4]: arr
Out[4]:
nanoarrow.Array<list<item: int64>>[4]
[0, 1]
[None, 3]
None
[4]
In [5]: na.c_array(arr)
Out[5]:
<nanoarrow.c_lib.CArray list<item: int64>>
- length: 4
- offset: 0
- null_count: 1
- buffers: (140041326198848, 140041326198784)
- dictionary: NULL
- children[1]:
'item': <nanoarrow.c_lib.CArray int64>
- length: 5
- offset: 0
- null_count: 1
- buffers: (140041326198912, 140041326198976)
- dictionary: NULL
- children[0]:
In [6]: na.c_array_view(arr)
Out[6]:
<nanoarrow.c_lib.CArrayView>
- storage_type: 'list'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
- validity <bool[1 b] 11010000>
- data_offset <int32[20 b] 0 2 4 4 5>
- dictionary: NULL
- children[1]:
- <nanoarrow.c_lib.CArrayView>
- storage_type: 'int64'
- length: 5
- offset: 0
- null_count: 1
- buffers[2]:
- validity <bool[1 b] 11011000>
- data <int64[40 b] 0 1 0 3 4>
- dictionary: NULL
- children[0]:
In [7]: na.c_schema(arr.schema)
Out[7]:
<nanoarrow.c_lib.CSchema list>
- format: '+l'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[1]:
'item': <nanoarrow.c_lib.CSchema int64>
- format: 'l'
- name: 'item'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
In [8]: na.c_schema_view(arr.schema)
Out[8]:
<nanoarrow.c_lib.CSchemaView>
- type: 'list'
- storage_type: 'list'
- layout: <nanoarrow._lib.CLayout object at 0x7f5de1b38200>
- nullable: True
- storage_type_id: 26
- type_id: 26
In [19]: arr.schema
Out[19]: Schema(LIST)
```
Some observations:
- The CArrayView repr is of course the most useful (for our purpose) because
it shows a preview of the actual content of the buffers, and names the buffers
(validity, data, data_offset)
- But I like that the CArray repr still shows the original type (`list<item:
int64>` in this example). In comparison with CArrayView, it also shows the
names of the children arrays
- I find the "storage_type" a bit confusing, as it is not a general known
concept (except for extension types) I think for the Arrow format
- For the schema, it is nice that the view translates the flags into
`nullable=True`, but it further also has less useful content (like the `layout`
entry, and the `(storage)_type_id` (which I think is nanoarrow specific?))
- Sidenote: should we make the main `Schema` repr more informative? (to let
it at least show `list<item: int64>` instead of just `LIST`?)
Of course, many of those aspects are things we can easily change if we want
(like adding a better schema repr to certain outputs), but just want to first
gather some feedback on what we actually want. And also, I am currently looking
at it very much from a educational point of view to explain the Arrow format
details (you might of course also want to use the above objects to access
certain information through the attributes in your code)
So one idea I had, specifically for the use case of inspecting the layout of
the data, we could also have some kind of `inspect()` function or method that
prints some combination of the above (that would also hide the lower-level
details of CArray vs CArrayView for this use case).
Or, alternatively, maybe we could improve the CArrayView repr a little bit,
and add a `view()` method on the Array to get it? (to avoid we have to do
`na.c_array_view(..)` in the tutorial)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]