jorisvandenbossche opened a new issue, #435:
URL: https://github.com/apache/arrow-nanoarrow/issues/435

   When you have an Array object and want to inspect the details (buffers, 
children, etc), you currently essentially have 2x 2 functions that each give 
slightly different information, the `c_array`, `c_array_view`, `c_schema`, 
`c_schema_view`. 
   In the tutorial material we are writing, we are mostly using `c_array_view` 
as that gives the most comprehensive and understandable repr.
   
   Example:
   
   ```python
   In [1]: import nanoarrow as na
   
   In [2]: pa_arr = pa.array([[0, 1], [None, 3], None, [4]])
   
   In [3]: arr = na.Array(pa_arr)
   
   In [4]: arr
   Out[4]: 
   nanoarrow.Array<list<item: int64>>[4]
   [0, 1]
   [None, 3]
   None
   [4]
   
   In [5]: na.c_array(arr)
   Out[5]: 
   <nanoarrow.c_lib.CArray list<item: int64>>
   - length: 4
   - offset: 0
   - null_count: 1
   - buffers: (140041326198848, 140041326198784)
   - dictionary: NULL
   - children[1]:
     'item': <nanoarrow.c_lib.CArray int64>
       - length: 5
       - offset: 0
       - null_count: 1
       - buffers: (140041326198912, 140041326198976)
       - dictionary: NULL
       - children[0]:
   
   In [6]: na.c_array_view(arr)
   Out[6]: 
   <nanoarrow.c_lib.CArrayView>
   - storage_type: 'list'
   - length: 4
   - offset: 0
   - null_count: 1
   - buffers[2]:
     - validity <bool[1 b] 11010000>
     - data_offset <int32[20 b] 0 2 4 4 5>
   - dictionary: NULL
   - children[1]:
     - <nanoarrow.c_lib.CArrayView>
       - storage_type: 'int64'
       - length: 5
       - offset: 0
       - null_count: 1
       - buffers[2]:
         - validity <bool[1 b] 11011000>
         - data <int64[40 b] 0 1 0 3 4>
       - dictionary: NULL
       - children[0]:
   
   In [7]: na.c_schema(arr.schema)
   Out[7]: 
   <nanoarrow.c_lib.CSchema list>
   - format: '+l'
   - name: ''
   - flags: 2
   - metadata: NULL
   - dictionary: NULL
   - children[1]:
     'item': <nanoarrow.c_lib.CSchema int64>
       - format: 'l'
       - name: 'item'
       - flags: 2
       - metadata: NULL
       - dictionary: NULL
       - children[0]:
   
   In [8]: na.c_schema_view(arr.schema)
   Out[8]: 
   <nanoarrow.c_lib.CSchemaView>
   - type: 'list'
   - storage_type: 'list'
   - layout: <nanoarrow._lib.CLayout object at 0x7f5de1b38200>
   - nullable: True
   - storage_type_id: 26
   - type_id: 26
   
   In [19]: arr.schema
   Out[19]: Schema(LIST)
   ```
   
   Some observations:
   
   - The CArrayView repr is of course the most useful (for our purpose) because 
it shows a preview of the actual content of the buffers, and names the buffers 
(validity, data, data_offset)
   - But I like that the CArray repr still shows the original type (`list<item: 
int64>` in this example). In comparison with CArrayView, it also shows the 
names of the children arrays
   - I find the "storage_type" a bit confusing, as it is not a general known 
concept (except for extension types) I think for the Arrow format
   - For the schema, it is nice that the view translates the flags into 
`nullable=True`, but it further also has less useful content (like the `layout` 
entry, and the `(storage)_type_id` (which I think is nanoarrow specific?))
   - Sidenote: should we make the main `Schema` repr more informative? (to let 
it at least show `list<item: int64>` instead of just `LIST`?)
   
   Of course, many of those aspects are things we can easily change if we want 
(like adding a better schema repr to certain outputs), but just want to first 
gather some feedback on what we actually want. And also, I am currently looking 
at it very much from a educational point of view to explain the Arrow format 
details (you might of course also want to use the above objects to access 
certain information through the attributes in your code)
   
   So one idea I had, specifically for the use case of inspecting the layout of 
the data, we could also have some kind of `inspect()` function or method that 
prints some combination of the above (that would also hide the lower-level 
details of CArray vs CArrayView for this use case). 
   
   Or, alternatively, maybe we could improve the CArrayView repr a little bit, 
and add a `view()` method on the Array to get it? (to avoid we have to do 
`na.c_array_view(..)` in the tutorial)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to