Re: [PR] refactor(python): Document, prefix, and add reprs for C-wrapping classes [arrow-nanoarrow]

via GitHub Wed, 10 Jan 2024 06:46:31 -0800


jorisvandenbossche commented on code in PR #340:
URL: https://github.com/apache/arrow-nanoarrow/pull/340#discussion_r1447398459



##########
.isort.cfg:
##########
@@ -0,0 +1,23 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[settings]

Review Comment:
   We can't put this in the pyproject.toml because that's not top-level?



##########
python/README.md:
##########
@@ -43,97 +43,129 @@ If you can import the namespace, you're good to go!
 import nanoarrow as na
 ```
 
-## Example
+## Low-level C library bindings
 
-The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`. All three can be wrapped by Python objects using the nanoarrow 
Python package.
+The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`.
 
 ### Schemas
 
-Use `nanoarrow.schema()` to convert a data type-like object to an 
`ArrowSchema`. This is currently only implemented for pyarrow objects.
+Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).
 
 
 ```python
 import pyarrow as pa
-schema = na.schema(pa.decimal128(10, 3))
+schema = na.c_schema(pa.decimal128(10, 3))
+schema
 ```
 
-You can extract the fields of a `Schema` object one at a time or parse it into 
a view to extract deserialized parameters.
+
+
+
+    <nanoarrow.c_lib.CSchema decimal128(10, 3)>
+    - format: 'd:10,3'
+    - name: ''
+    - flags: 2
+    - metadata: NULL
+    - dictionary: NULL
+    - children[0]:
+
+
+
+You can extract the fields of a `CSchema` object one at a time or parse it 
into a view to extract deserialized parameters.
 
 
 ```python
-print(schema.format)
-print(schema.view().decimal_precision)
-print(schema.view().decimal_scale)
+na.c_schema_view(schema)
 ```
 
-    d:10,3
-    10
-    3
 
 
-The `nanoarrow.schema()` helper is currently only implemented for pyarrow 
objects. If your data type has an `_export_to_c()`-like function, you can get 
the address of a freshly-allocated `ArrowSchema` as well:
+
+    <nanoarrow.c_lib.CSchemaView>
+    - type: 'decimal128'
+    - storage_type: 'decimal128'
+    - decimal_bitwidth: 128
+    - decimal_precision: 10
+    - decimal_scale: 3
+
+
+
+Advanced users can allocate an empty `CSchema` and populate its contents by 
passing its `._addr()` to a schema-exporting function.
 
 
 ```python
-schema = na.Schema.allocate()
+schema = na.c_schema()

Review Comment:
   Personally I find the previous way more explicit ..



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -76,116 +81,96 @@ cdef void pycapsule_array_deleter(object array_capsule) 
noexcept:
     if array.release != NULL:
         ArrowArrayRelease(array)
 
-    free(array)
+    ArrowFree(array)
 
 
 cdef object alloc_c_array(ArrowArray** c_array) noexcept:
-    c_array[0] = <ArrowArray*> malloc(sizeof(ArrowArray))
+    c_array[0] = <ArrowArray*> ArrowMalloc(sizeof(ArrowArray))
     # Ensure the capsule destructor doesn't call a random release pointer
     c_array[0].release = NULL
     return PyCapsule_New(c_array[0], 'arrow_array', &pycapsule_array_deleter)
 
 
-cdef void pycapsule_stream_deleter(object stream_capsule) noexcept:
+cdef void pycapsule_array_stream_deleter(object stream_capsule) noexcept:
     cdef ArrowArrayStream* stream = <ArrowArrayStream*>PyCapsule_GetPointer(
         stream_capsule, 'arrow_array_stream'
     )
     # Do not invoke the deleter on a used/moved capsule
     if stream.release != NULL:
         ArrowArrayStreamRelease(stream)
 
-    free(stream)
+    ArrowFree(stream)
 
 
-cdef object alloc_c_stream(ArrowArrayStream** c_stream) noexcept:
-    c_stream[0] = <ArrowArrayStream*> malloc(sizeof(ArrowArrayStream))
+cdef object alloc_c_array_stream(ArrowArrayStream** c_stream) noexcept:
+    c_stream[0] = <ArrowArrayStream*> ArrowMalloc(sizeof(ArrowArrayStream))
     # Ensure the capsule destructor doesn't call a random release pointer
     c_stream[0].release = NULL
-    return PyCapsule_New(c_stream[0], 'arrow_array_stream', 
&pycapsule_stream_deleter)
-
-
-cdef void arrow_array_release(ArrowArray* array) noexcept with gil:
-    Py_DECREF(<object>array.private_data)
-    array.private_data = NULL
-    array.release = NULL
-
-
-cdef class SchemaHolder:
-    """Memory holder for an ArrowSchema
-
-    This class is responsible for the lifecycle of the ArrowSchema
-    whose memory it is responsible for. When this object is deleted,
-    a non-NULL release callback is invoked.
-    """
-    cdef ArrowSchema c_schema
-
-    def __cinit__(self):
-        self.c_schema.release = NULL
+    return PyCapsule_New(c_stream[0], 'arrow_array_stream', 
&pycapsule_array_stream_deleter)
 
-    def __dealloc__(self):
-        if self.c_schema.release != NULL:
-          ArrowSchemaRelease(&self.c_schema)
 
-    def _addr(self):
-        return <uintptr_t>&self.c_schema
+cdef void pycapsule_device_array_deleter(object device_array_capsule) noexcept:
+    cdef ArrowDeviceArray* device_array = 
<ArrowDeviceArray*>PyCapsule_GetPointer(
+        device_array_capsule, 'arrow_device_array'
+    )
+    # Do not invoke the deleter on a used/moved capsule
+    if device_array.array.release != NULL:
+        device_array.array.release(&device_array.array)
 
+    ArrowFree(device_array)
 
-cdef class ArrayHolder:
-    """Memory holder for an ArrowArray
 
-    This class is responsible for the lifecycle of the ArrowArray
-    whose memory it is responsible. When this object is deleted,
-    a non-NULL release callback is invoked.
-    """
-    cdef ArrowArray c_array
+cdef object alloc_c_device_array(ArrowDeviceArray** c_device_array) noexcept:
+    c_device_array[0] = <ArrowDeviceArray*> 
ArrowMalloc(sizeof(ArrowDeviceArray))
+    # Ensure the capsule destructor doesn't call a random release pointer
+    c_device_array[0].array.release = NULL
+    return PyCapsule_New(c_device_array[0], 'arrow_device_array', 
&pycapsule_device_array_deleter)
 
-    def __cinit__(self):
-        self.c_array.release = NULL
 
-    def __dealloc__(self):
-        if self.c_array.release != NULL:
-          ArrowArrayRelease(&self.c_array)
+cdef void pycapsule_array_view_deleter(object array_capsule) noexcept:
+    cdef ArrowArrayView* array_view = <ArrowArrayView*>PyCapsule_GetPointer(
+        array_capsule, 'nanoarrow_array_view'
+    )
 
-    def _addr(self):
-        return <uintptr_t>&self.c_array
+    ArrowArrayViewReset(array_view)
 
-cdef class ArrayStreamHolder:
-    """Memory holder for an ArrowArrayStream
+    ArrowFree(array_view)
 
-    This class is responsible for the lifecycle of the ArrowArrayStream
-    whose memory it is responsible. When this object is deleted,
-    a non-NULL release callback is invoked.
-    """
-    cdef ArrowArrayStream c_array_stream
 
-    def __cinit__(self):
-        self.c_array_stream.release = NULL
+cdef object alloc_c_array_view(ArrowArrayView** c_array_view) noexcept:
+    c_array_view[0] = <ArrowArrayView*> ArrowMalloc(sizeof(ArrowArrayView))
+    ArrowArrayViewInitFromType(c_array_view[0], NANOARROW_TYPE_UNINITIALIZED)
+    return PyCapsule_New(c_array_view[0], 'nanoarrow_array_view', 
&pycapsule_array_view_deleter)
 
-    def __dealloc__(self):
-        if self.c_array_stream.release != NULL:
-            ArrowArrayStreamRelease(&self.c_array_stream)
 
-    def _addr(self):
-        return <uintptr_t>&self.c_array_stream
+# To more safely implement export of an ArrowArray whose address may be

Review Comment:
   FWIW you can also add this as a normal docstring to the function



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -352,57 +326,45 @@ cdef class Schema:
     def metadata(self):
         self._assert_valid()
         if self._ptr.metadata != NULL:
-            return SchemaMetadata(self, <uintptr_t>self._ptr.metadata)
+            return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata)
         else:
             return None
 
     @property
-    def children(self):
+    def n_children(self):
+        self._assert_valid()
+        return self._ptr.n_children
+
+    def child(self, int64_t i):
         self._assert_valid()
-        return SchemaChildren(self)
+        if i < 0 or i >= self._ptr.n_children:
+            raise IndexError(f"{i} out of range [0, {self._ptr.n_children})")
+
+        return CSchema(self._base, <uintptr_t>self._ptr.children[i])
+
+    @property
+    def children(self):
+        for i in range(self.n_children):
+            yield self.child(i)
 
     @property
     def dictionary(self):
         self._assert_valid()
         if self._ptr.dictionary != NULL:
-            return Schema(self, <uintptr_t>self._ptr.dictionary)
+            return CSchema(self, <uintptr_t>self._ptr.dictionary)
         else:
             return None
 
-    def view(self):

Review Comment:
   We could still keep this method for convenience? (so you don't have to pass 
your schema object to two different functions)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -352,57 +326,45 @@ cdef class Schema:
     def metadata(self):
         self._assert_valid()
         if self._ptr.metadata != NULL:
-            return SchemaMetadata(self, <uintptr_t>self._ptr.metadata)
+            return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata)
         else:
             return None
 
     @property
-    def children(self):
+    def n_children(self):
+        self._assert_valid()
+        return self._ptr.n_children
+
+    def child(self, int64_t i):
         self._assert_valid()
-        return SchemaChildren(self)
+        if i < 0 or i >= self._ptr.n_children:
+            raise IndexError(f"{i} out of range [0, {self._ptr.n_children})")
+
+        return CSchema(self._base, <uintptr_t>self._ptr.children[i])
+
+    @property
+    def children(self):
+        for i in range(self.n_children):
+            yield self.child(i)
 
     @property
     def dictionary(self):
         self._assert_valid()
         if self._ptr.dictionary != NULL:
-            return Schema(self, <uintptr_t>self._ptr.dictionary)
+            return CSchema(self, <uintptr_t>self._ptr.dictionary)
         else:
             return None
 
-    def view(self):

Review Comment:
   One reason this would be useful is because the SchemaView doesn't give you 
access to the children (right? that's maybe also something that could be 
changed). So if you want to have a view of a child of a schema, you need 
something like `na.c_schema_view(na.c_schema(schema_obj).child(0))`?



##########
python/src/nanoarrow/_lib_utils.py:
##########
@@ -74,17 +79,134 @@ def array_repr(array, indent=0):
     else:
         lines.append(f"{indent_str}- dictionary: NULL")
 
-    children = array.children
-    lines.append(f"{indent_str}- children[{len(children)}]:")
-    for child in children:
+    lines.append(f"{indent_str}- children[{array.n_children}]:")
+    for child in array.children:
         child_repr = array_repr(child, indent=indent + 4)
         lines.append(f"{indent_str}  {repr(child.schema.name)}: {child_repr}")
 
     return "\n".join(lines)
 
 
+def schema_view_repr(schema_view):
+    lines = [
+        "<nanoarrow.c_lib.CSchemaView>",
+        f"- type: {repr(schema_view.type)}",
+        f"- storage_type: {repr(schema_view.storage_type)}",
+    ]
+
+    for attr_name in sorted(dir(schema_view)):
+        if attr_name.startswith("_") or attr_name in ("type", "storage_type"):
+            continue
+
+        attr_value = getattr(schema_view, attr_name)
+        if attr_value is None:
+            continue
+
+        lines.append(f"- {attr_name}: {repr(attr_value)}")
+
+    return "\n".join(lines)

Review Comment:
   Do we want to show something about the children here?
   
   Because right now for example for a list type, the schema view repr is less 
informative than the main schema repr:
   
   ```
   In [68]: schema
   Out[68]: 
   a: int64
   b: list<item: double>
     child 0, item: double
   
   In [69]: na.c_schema(schema).child(1)
   Out[69]: 
   <nanoarrow.c_lib.CSchema list>
   - format: '+l'
   - name: 'b'
   - flags: 2
   - metadata: NULL
   - dictionary: NULL
   - children[1]:
     'item': <nanoarrow.c_lib.CSchema double>
       - format: 'g'
       - name: 'item'
       - flags: 2
       - metadata: NULL
       - dictionary: NULL
       - children[0]:
   
   In [70]: na.c_schema_view(na.c_schema(schema).child(1))
   Out[70]: 
   <nanoarrow.c_lib.CSchemaView>
   - type: 'list'
   - storage_type: 'list'
   ```
   
   So the schema view repr doesn't say what type of list it is (just "list")
   



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -890,50 +790,129 @@ cdef class BufferView:
         self._element_size_bits = element_size_bits
         self._strides = self._item_size()
         self._shape = self._ptr.size_bytes // self._strides
+        self._format[0] = 0
+        self._populate_format()
+
+    def _addr(self):
+        return <uintptr_t>self._ptr.data.data
 
+    @property
+    def device_type(self):
+        return self._device.device_type
+
+    @property
+    def device_id(self):
+        return self._device.device_id
+
+    @property
+    def element_size_bits(self):
+        return self._element_size_bits
+
+    @property
+    def size_bytes(self):
+        return self._ptr.size_bytes
+
+    @property
+    def type(self):
+        if self._buffer_type == NANOARROW_BUFFER_TYPE_VALIDITY:
+            return "validity"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_TYPE_ID:
+            return "type_id"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_UNION_OFFSET:
+            return "union_offset"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA_OFFSET:
+            return "data_offset"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA:
+            return "data"
+
+    @property
+    def data_type(self):
+        return ArrowTypeString(self._buffer_data_type).decode("UTF-8")
+
+    @property
+    def format(self):
+        return self._format.decode("UTF-8")
+
+    @property
+    def item_size(self):
+        return self._strides
+
+    def __len__(self):
+        return self._shape
+
+    def __getitem__(self, int64_t i):
+        if i < 0 or i >= self._shape:
+            raise IndexError(f"Index {i} out of range")
+        cdef int64_t offset = self._strides * i
+        value = unpack_from(self.format, buffer=self, offset=offset)
+        if len(value) == 1:
+            return value[0]
+        else:
+            return value
+
+    def __iter__(self):
+        for value in iter_unpack(self.format, self):
+            if len(value) == 1:
+                yield value[0]
+            else:
+                yield value

Review Comment:
   Hmm, it seems that this doesn't work with the endianness "=" you added below 
to the format type of the buffer protocol



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -352,57 +326,45 @@ cdef class Schema:
     def metadata(self):
         self._assert_valid()
         if self._ptr.metadata != NULL:
-            return SchemaMetadata(self, <uintptr_t>self._ptr.metadata)
+            return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata)
         else:
             return None
 
     @property
-    def children(self):
+    def n_children(self):
+        self._assert_valid()
+        return self._ptr.n_children
+
+    def child(self, int64_t i):
         self._assert_valid()
-        return SchemaChildren(self)
+        if i < 0 or i >= self._ptr.n_children:
+            raise IndexError(f"{i} out of range [0, {self._ptr.n_children})")
+
+        return CSchema(self._base, <uintptr_t>self._ptr.children[i])
+
+    @property
+    def children(self):
+        for i in range(self.n_children):
+            yield self.child(i)
 
     @property
     def dictionary(self):
         self._assert_valid()
         if self._ptr.dictionary != NULL:
-            return Schema(self, <uintptr_t>self._ptr.dictionary)
+            return CSchema(self, <uintptr_t>self._ptr.dictionary)
         else:
             return None
 
-    def view(self):
-        self._assert_valid()
-        schema_view = SchemaView()
-        cdef Error error = Error()
-        cdef int result = ArrowSchemaViewInit(&schema_view._schema_view, 
self._ptr, &error.c_error)
-        if result != NANOARROW_OK:
-            error.raise_message("ArrowSchemaViewInit()", result)
 
-        return schema_view
+cdef class CSchemaView:
+    """Low-level ArrowSchemaView wrapper
 
+    This object is a literal wrapper around a read-only ArrowSchema. It 
provides field accessors
+    that return Python objects and handles structure lifecycle.
 
-cdef class SchemaView:
-    """ArrowSchemaView wrapper
-
-    The ArrowSchemaView is a nanoarrow C library structure that facilitates
-    access to the deserialized content of an ArrowSchema (e.g., parameter
-    values for parameterized types). This wrapper extends that facility to 
Python.
-
-    Examples
-    --------
-
-    >>> import pyarrow as pa
-    >>> import nanoarrow as na
-    >>> schema = na.schema(pa.decimal128(10, 3))
-    >>> schema_view = schema.view()
-    >>> schema_view.type
-    'decimal128'
-    >>> schema_view.decimal_bitwidth
-    128
-    >>> schema_view.decimal_precision
-    10
-    >>> schema_view.decimal_scale
-    3
+    See `nanoarrow.c_schema_view()` for construction and usage examples.
     """
+    cdef object _base

Review Comment:
   ```suggestion
       cdef CSchema _base
   ```
   
   ? (and if that is correct, could maybe also use more explicit name)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -352,57 +326,45 @@ cdef class Schema:
     def metadata(self):
         self._assert_valid()
         if self._ptr.metadata != NULL:
-            return SchemaMetadata(self, <uintptr_t>self._ptr.metadata)
+            return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata)
         else:
             return None
 
     @property
-    def children(self):
+    def n_children(self):
+        self._assert_valid()
+        return self._ptr.n_children
+
+    def child(self, int64_t i):
         self._assert_valid()
-        return SchemaChildren(self)
+        if i < 0 or i >= self._ptr.n_children:
+            raise IndexError(f"{i} out of range [0, {self._ptr.n_children})")
+
+        return CSchema(self._base, <uintptr_t>self._ptr.children[i])
+
+    @property
+    def children(self):
+        for i in range(self.n_children):
+            yield self.child(i)
 
     @property
     def dictionary(self):
         self._assert_valid()
         if self._ptr.dictionary != NULL:
-            return Schema(self, <uintptr_t>self._ptr.dictionary)
+            return CSchema(self, <uintptr_t>self._ptr.dictionary)
         else:
             return None
 
-    def view(self):
-        self._assert_valid()
-        schema_view = SchemaView()
-        cdef Error error = Error()
-        cdef int result = ArrowSchemaViewInit(&schema_view._schema_view, 
self._ptr, &error.c_error)
-        if result != NANOARROW_OK:
-            error.raise_message("ArrowSchemaViewInit()", result)
 
-        return schema_view
+cdef class CSchemaView:
+    """Low-level ArrowSchemaView wrapper
 
+    This object is a literal wrapper around a read-only ArrowSchema. It 
provides field accessors
+    that return Python objects and handles structure lifecycle.
 
-cdef class SchemaView:
-    """ArrowSchemaView wrapper
-
-    The ArrowSchemaView is a nanoarrow C library structure that facilitates
-    access to the deserialized content of an ArrowSchema (e.g., parameter

Review Comment:
   I would keep this content in the new docstring, as it's still useful to 
explain the difference between CSchema end CSchemaView (for a user of the 
python library not familiar with the nanoarrow c details)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -804,64 +761,6 @@ cdef class SchemaMetadata:
             yield key_obj, value_obj
 
 
-cdef class ArrayChildren:

Review Comment:
   Nice to see those Children classes removed! ;)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -630,65 +574,71 @@ cdef class Array:
 
     @property
     def null_count(self):
+        self._assert_valid()
         return self._ptr.null_count
 
+    @property
+    def n_buffers(self):
+        self._assert_valid()
+        return self._ptr.n_buffers
+
     @property
     def buffers(self):
+        self._assert_valid()
         return tuple(<uintptr_t>self._ptr.buffers[i] for i in 
range(self._ptr.n_buffers))
 
+    @property
+    def n_children(self):
+        self._assert_valid()
+        return self._ptr.n_children
+
+    def child(self, int64_t i):
+        self._assert_valid()
+        if i < 0 or i >= self._ptr.n_children:
+            raise IndexError(f"{i} out of range [0, {self._ptr.n_children})")
+        return CArray(self._base, <uintptr_t>self._ptr.children[i], 
self._schema.child(i))
+
     @property
     def children(self):
-        return ArrayChildren(self)
+        for i in range(self.n_children):
+            yield self.child(i)
 
     @property
     def dictionary(self):
         self._assert_valid()
         if self._ptr.dictionary != NULL:
-            return Array(self, <uintptr_t>self._ptr.dictionary, 
self._schema.dictionary)
+            return CArray(self, <uintptr_t>self._ptr.dictionary, 
self._schema.dictionary)
         else:
             return None
 
     def __repr__(self):
-        return array_repr(self)
-
-
-cdef class ArrayView:
-    """ArrowArrayView wrapper
-
-    The ArrowArrayView is a nanoarrow C library structure that provides
-    structured access to buffers addresses, buffer sizes, and buffer
-    data types. The buffer data is usually propagated from an ArrowArray
-    but can also be propagated from other types of objects (e.g., serialized
-    IPC). The offset and length of this view are independent of its parent
-    (i.e., this object can also represent a slice of its parent).

Review Comment:
   Same comment here about the docstring



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -233,39 +218,23 @@ cdef class Error:
         raise NanoarrowException(what, code, "")
 
 
-cdef class Schema:
-    """ArrowSchema wrapper
-
-    This class provides a user-facing interface to access the fields of
-    an ArrowSchema as defined in the Arrow C Data interface. These objects
-    are usually created using `nanoarrow.schema()`. This Python wrapper
-    allows access to schema fields but does not automatically deserialize
-    their content: use `.view()` to validate and deserialize the content
-    into a more easily inspectable object.
-
-    Examples
-    --------
-
-    >>> import pyarrow as pa
-    >>> import nanoarrow as na
-    >>> schema = na.schema(pa.int32())
-    >>> schema.is_valid()
-    True
-    >>> schema.format
-    'i'
-    >>> schema.name
-    ''
-    >>> schema_view = schema.view()
-    >>> schema_view.type
-    'int32'
+cdef class CSchema:
+    """Low-level ArrowSchema wrapper
+
+    This object is a literal wrapper around a read-only ArrowSchema. It 
provides field accessors
+    that return Python objects and handles the C Data interface lifecycle 
(i.e., initialized
+    ArrowSchema structures are always released).
+
+    See `nanoarrow.c_schema()` for construction and usage examples.
     """
     cdef object _base

Review Comment:
   This `_base` is now always a capsule? (if so, maybe add a comment saying 
that)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -24,45 +24,50 @@ This Cython extension provides low-level Python wrappers 
around the
 Arrow C Data and Arrow C Stream interface structs. In general, there
 is one wrapper per C struct and pointer validity is managed by keeping
 strong references to Python objects. These wrappers are intended to
-be literal and stay close to the structure definitions.
+be literal and stay close to the structure definitions: higher level
+interfaces can and should be built in Python where it is faster to
+iterate and where it is easier to create a better user experience
+by default (i.e., classes, methods, and functions implemented in Python
+generally have better autocomplete + documentation available to IDEs).
 """
 
 from libc.stdint cimport uintptr_t, int64_t
-from libc.stdlib cimport malloc, free
 from libc.string cimport memcpy
-from cpython.mem cimport PyMem_Malloc, PyMem_Free
+from libc.stdio cimport snprintf
 from cpython.bytes cimport PyBytes_FromStringAndSize
-from cpython.pycapsule cimport PyCapsule_New, PyCapsule_GetPointer, 
PyCapsule_CheckExact
+from cpython.pycapsule cimport PyCapsule_New, PyCapsule_GetPointer
 from cpython cimport Py_buffer
-from cpython.ref cimport PyObject, Py_INCREF, Py_DECREF
+from cpython.ref cimport Py_INCREF, Py_DECREF
 from nanoarrow_c cimport *
 from nanoarrow_device_c cimport *
 
-from nanoarrow._lib_utils import array_repr, device_array_repr, schema_repr, 
device_repr
+from struct import unpack_from, iter_unpack
+from nanoarrow import _lib_utils
 
 def c_version():
     """Return the nanoarrow C library version string
     """
     return ArrowNanoarrowVersion().decode("UTF-8")
 
 
+# PyCapsule utilities
 #
-# PyCapsule export utilities
-#
-
-
+# PyCapsules are used (1) to safely manage memory associated with C structures
+# by initializing them and ensuring the appropriate cleanup is invoked when
+# the object is deleted; and (2) as an export mechanism conforming to the
+# Arrow PyCapsule interface for the objects where this is defined.
 cdef void pycapsule_schema_deleter(object schema_capsule) noexcept:
     cdef ArrowSchema* schema = <ArrowSchema*>PyCapsule_GetPointer(
         schema_capsule, 'arrow_schema'
     )
     if schema.release != NULL:
         ArrowSchemaRelease(schema)
 
-    free(schema)
+    ArrowFree(schema)

Review Comment:
   For my education: is there a benefit in using the nanoarrow version?



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -630,65 +574,71 @@ cdef class Array:
 
     @property
     def null_count(self):
+        self._assert_valid()
         return self._ptr.null_count
 
+    @property
+    def n_buffers(self):
+        self._assert_valid()
+        return self._ptr.n_buffers
+
     @property
     def buffers(self):
+        self._assert_valid()
         return tuple(<uintptr_t>self._ptr.buffers[i] for i in 
range(self._ptr.n_buffers))
 
+    @property
+    def n_children(self):
+        self._assert_valid()
+        return self._ptr.n_children
+
+    def child(self, int64_t i):
+        self._assert_valid()
+        if i < 0 or i >= self._ptr.n_children:
+            raise IndexError(f"{i} out of range [0, {self._ptr.n_children})")
+        return CArray(self._base, <uintptr_t>self._ptr.children[i], 
self._schema.child(i))
+
     @property
     def children(self):
-        return ArrayChildren(self)
+        for i in range(self.n_children):
+            yield self.child(i)
 
     @property
     def dictionary(self):
         self._assert_valid()
         if self._ptr.dictionary != NULL:
-            return Array(self, <uintptr_t>self._ptr.dictionary, 
self._schema.dictionary)
+            return CArray(self, <uintptr_t>self._ptr.dictionary, 
self._schema.dictionary)
         else:
             return None
 
     def __repr__(self):
-        return array_repr(self)
-
-
-cdef class ArrayView:
-    """ArrowArrayView wrapper
-
-    The ArrowArrayView is a nanoarrow C library structure that provides
-    structured access to buffers addresses, buffer sizes, and buffer
-    data types. The buffer data is usually propagated from an ArrowArray
-    but can also be propagated from other types of objects (e.g., serialized
-    IPC). The offset and length of this view are independent of its parent
-    (i.e., this object can also represent a slice of its parent).
-
-    Examples
-    --------
-
-    >>> import pyarrow as pa
-    >>> import numpy as np
-    >>> import nanoarrow as na
-    >>> array = na.array(pa.array(["one", "two", "three", None]))
-    >>> array_view = na.array_view(array)
-    >>> np.array(array_view.buffers[1])
-    array([ 0,  3,  6, 11, 11], dtype=int32)
-    >>> np.array(array_view.buffers[2])
-    array([b'o', b'n', b'e', b't', b'w', b'o', b't', b'h', b'r', b'e', b'e'],
-          dtype='|S1')
+        return _lib_utils.array_repr(self)
+
+
+cdef class CArrayView:
+    """Low-level ArrowArrayView wrapper
+
+    This object is a literal wrapper around an ArrowArrayView. It provides 
field accessors
+    that return Python objects and handles the structure lifecycle (i.e., 
initialized
+    ArrowArrayView structures are always released).
+
+    See `nanoarrow.c_array_view()` for construction and usage examples.
     """
     cdef object _base
     cdef ArrowArrayView* _ptr
     cdef ArrowDevice* _device
-    cdef Schema _schema
-    cdef object _base_buffer
 
-    def __cinit__(self, object base, uintptr_t addr, Schema schema, object 
base_buffer):
+    def __cinit__(self, object base, uintptr_t addr):
         self._base = base
         self._ptr = <ArrowArrayView*>addr
-        self._schema = schema
-        self._base_buffer = base_buffer
         self._device = ArrowDeviceCpu()
 
+    @property
+    def storage_type(self):

Review Comment:
   I see that `storage_type` already existed in the SchemaView before, but what 
is exactly the difference with `type`?



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -947,88 +926,28 @@ cdef class BufferView:
     def __releasebuffer__(self, Py_buffer *buffer):
         pass
 
+    def __repr__(self):
+        return _lib_utils.buffer_view_repr(self)

Review Comment:
   It might be nice to include a name here as well for the standalone repr (the 
util function only gives you the content, which is useful for including it into 
another repr).
   Something like 
   
   ```suggestion
           return f"nanoarrow.c_lib.BufferView 
{_lib_utils.buffer_view_repr(self)[1:]}"
   ```
   
   (the slicing is because it already starts with a `<` (that could also be 
changed in the util function)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -890,50 +790,129 @@ cdef class BufferView:
         self._element_size_bits = element_size_bits
         self._strides = self._item_size()
         self._shape = self._ptr.size_bytes // self._strides
+        self._format[0] = 0
+        self._populate_format()
+
+    def _addr(self):
+        return <uintptr_t>self._ptr.data.data
 
+    @property
+    def device_type(self):
+        return self._device.device_type
+
+    @property
+    def device_id(self):
+        return self._device.device_id
+
+    @property
+    def element_size_bits(self):
+        return self._element_size_bits
+
+    @property
+    def size_bytes(self):
+        return self._ptr.size_bytes
+
+    @property
+    def type(self):
+        if self._buffer_type == NANOARROW_BUFFER_TYPE_VALIDITY:
+            return "validity"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_TYPE_ID:
+            return "type_id"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_UNION_OFFSET:
+            return "union_offset"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA_OFFSET:
+            return "data_offset"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA:
+            return "data"
+
+    @property
+    def data_type(self):
+        return ArrowTypeString(self._buffer_data_type).decode("UTF-8")
+
+    @property
+    def format(self):
+        return self._format.decode("UTF-8")
+
+    @property
+    def item_size(self):
+        return self._strides
+
+    def __len__(self):
+        return self._shape
+
+    def __getitem__(self, int64_t i):
+        if i < 0 or i >= self._shape:
+            raise IndexError(f"Index {i} out of range")
+        cdef int64_t offset = self._strides * i
+        value = unpack_from(self.format, buffer=self, offset=offset)
+        if len(value) == 1:
+            return value[0]
+        else:
+            return value
+
+    def __iter__(self):
+        for value in iter_unpack(self.format, self):
+            if len(value) == 1:
+                yield value[0]
+            else:
+                yield value

Review Comment:
   A Python memoryview object supports this kind of indexing, and a conversion 
to a python list as well 
(https://docs.python.org/3/library/stdtypes.html#memoryview.tolist). So a 
potential alternative is to reuse that (`memoryview(self).tolist())` might work 
out of the box)



##########
.isort.cfg:
##########
@@ -0,0 +1,23 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[settings]

Review Comment:
   (btw, for another PR, but I would also switch to use ruff for linting, that 
also includes the functionality of isort)



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -890,50 +790,129 @@ cdef class BufferView:
         self._element_size_bits = element_size_bits
         self._strides = self._item_size()
         self._shape = self._ptr.size_bytes // self._strides
+        self._format[0] = 0
+        self._populate_format()
+
+    def _addr(self):
+        return <uintptr_t>self._ptr.data.data
 
+    @property
+    def device_type(self):
+        return self._device.device_type
+
+    @property
+    def device_id(self):
+        return self._device.device_id
+
+    @property
+    def element_size_bits(self):
+        return self._element_size_bits
+
+    @property
+    def size_bytes(self):
+        return self._ptr.size_bytes
+
+    @property
+    def type(self):
+        if self._buffer_type == NANOARROW_BUFFER_TYPE_VALIDITY:
+            return "validity"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_TYPE_ID:
+            return "type_id"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_UNION_OFFSET:
+            return "union_offset"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA_OFFSET:
+            return "data_offset"
+        elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA:
+            return "data"
+
+    @property
+    def data_type(self):
+        return ArrowTypeString(self._buffer_data_type).decode("UTF-8")
+
+    @property
+    def format(self):
+        return self._format.decode("UTF-8")
+
+    @property
+    def item_size(self):
+        return self._strides
+
+    def __len__(self):
+        return self._shape
+
+    def __getitem__(self, int64_t i):
+        if i < 0 or i >= self._shape:
+            raise IndexError(f"Index {i} out of range")
+        cdef int64_t offset = self._strides * i
+        value = unpack_from(self.format, buffer=self, offset=offset)
+        if len(value) == 1:
+            return value[0]
+        else:
+            return value
+
+    def __iter__(self):
+        for value in iter_unpack(self.format, self):
+            if len(value) == 1:
+                yield value[0]
+            else:
+                yield value
 
     cdef Py_ssize_t _item_size(self):
-        if self._buffer_data_type == NANOARROW_TYPE_BOOL:
-            return 1
-        elif self._buffer_data_type == NANOARROW_TYPE_STRING:
-            return 1
-        elif self._buffer_data_type == NANOARROW_TYPE_BINARY:
+        if self._element_size_bits < 8:
             return 1
         else:
             return self._element_size_bits // 8
 
-    cdef const char* _get_format(self):
-        if self._buffer_data_type == NANOARROW_TYPE_INT8:
-            return "b"
+    cdef void _populate_format(self):
+        cdef const char* format_const = NULL
+        if self._element_size_bits == 0:
+            # Variable-size elements (e.g., data buffer for string or binary) 
export as
+            # one byte per element (character if string, unspecified binary 
otherwise)
+            if self._buffer_data_type == NANOARROW_TYPE_STRING:
+                format_const = "c"
+            else:
+                format_const = "B"
+        elif self._element_size_bits < 8:
+            # Bitmaps export as unspecified binary
+            format_const = "B"
+        elif self._buffer_data_type == NANOARROW_TYPE_INT8:
+            format_const = "b"
         elif self._buffer_data_type == NANOARROW_TYPE_UINT8:
-            return "B"
+            format_const = "B"
         elif self._buffer_data_type == NANOARROW_TYPE_INT16:
-            return "h"
+            format_const = "=h"
         elif self._buffer_data_type == NANOARROW_TYPE_UINT16:
-            return "H"
+            format_const = "=H"
         elif self._buffer_data_type == NANOARROW_TYPE_INT32:
-            return "i"
+            format_const = "=i"
         elif self._buffer_data_type == NANOARROW_TYPE_UINT32:
-            return "I"
+            format_const = "=I"
         elif self._buffer_data_type == NANOARROW_TYPE_INT64:
-            return "l"
+            format_const = "=q"
         elif self._buffer_data_type == NANOARROW_TYPE_UINT64:
-            return "L"
+            format_const = "=Q"
+        elif self._buffer_data_type == NANOARROW_TYPE_HALF_FLOAT:
+            format_const = "=e"
         elif self._buffer_data_type == NANOARROW_TYPE_FLOAT:
-            return "f"
+            format_const = "=f"
         elif self._buffer_data_type == NANOARROW_TYPE_DOUBLE:
-            return "d"
-        elif self._buffer_data_type == NANOARROW_TYPE_STRING:
-            return "c"
+            format_const = "=d"
+        elif self._buffer_data_type == NANOARROW_TYPE_INTERVAL_DAY_TIME:
+            format_const = "=ii"
+        elif self._buffer_data_type == NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO:
+            format_const = "=iiq"
+
+        if format_const != NULL:
+            snprintf(self._format, sizeof(self._format), "%s", format_const)
         else:
-            return "B"
+            snprintf(self._format, sizeof(self._format), "%ds", 
self._element_size_bits // 8)

Review Comment:
   Why is this needed (compared to just returning the string as was done 
before)?



##########
python/README.md:
##########
@@ -43,97 +43,129 @@ If you can import the namespace, you're good to go!
 import nanoarrow as na
 ```
 
-## Example
+## Low-level C library bindings
 
-The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`. All three can be wrapped by Python objects using the nanoarrow 
Python package.
+The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`.
 
 ### Schemas
 
-Use `nanoarrow.schema()` to convert a data type-like object to an 
`ArrowSchema`. This is currently only implemented for pyarrow objects.
+Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).
 
 
 ```python
 import pyarrow as pa
-schema = na.schema(pa.decimal128(10, 3))
+schema = na.c_schema(pa.decimal128(10, 3))
+schema
 ```
 
-You can extract the fields of a `Schema` object one at a time or parse it into 
a view to extract deserialized parameters.
+
+
+
+    <nanoarrow.c_lib.CSchema decimal128(10, 3)>
+    - format: 'd:10,3'
+    - name: ''
+    - flags: 2
+    - metadata: NULL
+    - dictionary: NULL
+    - children[0]:
+
+
+
+You can extract the fields of a `CSchema` object one at a time or parse it 
into a view to extract deserialized parameters.
 
 
 ```python
-print(schema.format)
-print(schema.view().decimal_precision)
-print(schema.view().decimal_scale)
+na.c_schema_view(schema)
 ```
 
-    d:10,3
-    10
-    3
 
 
-The `nanoarrow.schema()` helper is currently only implemented for pyarrow 
objects. If your data type has an `_export_to_c()`-like function, you can get 
the address of a freshly-allocated `ArrowSchema` as well:
+
+    <nanoarrow.c_lib.CSchemaView>
+    - type: 'decimal128'
+    - storage_type: 'decimal128'
+    - decimal_bitwidth: 128
+    - decimal_precision: 10
+    - decimal_scale: 3
+
+
+
+Advanced users can allocate an empty `CSchema` and populate its contents by 
passing its `._addr()` to a schema-exporting function.
 
 
 ```python
-schema = na.Schema.allocate()
+schema = na.c_schema()

Review Comment:
   But seeing the version below for Array, I admit that there it is a little 
inconvenient you need pass an allocated schema to the Array allocation 
(although this could also be done for the user automatically?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] refactor(python): Document, prefix, and add reprs for C-wrapping classes [arrow-nanoarrow]

Reply via email to