jorisvandenbossche commented on code in PR #340: URL: https://github.com/apache/arrow-nanoarrow/pull/340#discussion_r1447398459
########## .isort.cfg: ########## @@ -0,0 +1,23 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[settings] Review Comment: We can't put this in the pyproject.toml because that's not top-level? ########## python/README.md: ########## @@ -43,97 +43,129 @@ If you can import the namespace, you're good to go! import nanoarrow as na ``` -## Example +## Low-level C library bindings -The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. All three can be wrapped by Python objects using the nanoarrow Python package. +The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. ### Schemas -Use `nanoarrow.schema()` to convert a data type-like object to an `ArrowSchema`. This is currently only implemented for pyarrow objects. +Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`). ```python import pyarrow as pa -schema = na.schema(pa.decimal128(10, 3)) +schema = na.c_schema(pa.decimal128(10, 3)) +schema ``` -You can extract the fields of a `Schema` object one at a time or parse it into a view to extract deserialized parameters. + + + + <nanoarrow.c_lib.CSchema decimal128(10, 3)> + - format: 'd:10,3' + - name: '' + - flags: 2 + - metadata: NULL + - dictionary: NULL + - children[0]: + + + +You can extract the fields of a `CSchema` object one at a time or parse it into a view to extract deserialized parameters. ```python -print(schema.format) -print(schema.view().decimal_precision) -print(schema.view().decimal_scale) +na.c_schema_view(schema) ``` - d:10,3 - 10 - 3 -The `nanoarrow.schema()` helper is currently only implemented for pyarrow objects. If your data type has an `_export_to_c()`-like function, you can get the address of a freshly-allocated `ArrowSchema` as well: + + <nanoarrow.c_lib.CSchemaView> + - type: 'decimal128' + - storage_type: 'decimal128' + - decimal_bitwidth: 128 + - decimal_precision: 10 + - decimal_scale: 3 + + + +Advanced users can allocate an empty `CSchema` and populate its contents by passing its `._addr()` to a schema-exporting function. ```python -schema = na.Schema.allocate() +schema = na.c_schema() Review Comment: Personally I find the previous way more explicit .. ########## python/src/nanoarrow/_lib.pyx: ########## @@ -76,116 +81,96 @@ cdef void pycapsule_array_deleter(object array_capsule) noexcept: if array.release != NULL: ArrowArrayRelease(array) - free(array) + ArrowFree(array) cdef object alloc_c_array(ArrowArray** c_array) noexcept: - c_array[0] = <ArrowArray*> malloc(sizeof(ArrowArray)) + c_array[0] = <ArrowArray*> ArrowMalloc(sizeof(ArrowArray)) # Ensure the capsule destructor doesn't call a random release pointer c_array[0].release = NULL return PyCapsule_New(c_array[0], 'arrow_array', &pycapsule_array_deleter) -cdef void pycapsule_stream_deleter(object stream_capsule) noexcept: +cdef void pycapsule_array_stream_deleter(object stream_capsule) noexcept: cdef ArrowArrayStream* stream = <ArrowArrayStream*>PyCapsule_GetPointer( stream_capsule, 'arrow_array_stream' ) # Do not invoke the deleter on a used/moved capsule if stream.release != NULL: ArrowArrayStreamRelease(stream) - free(stream) + ArrowFree(stream) -cdef object alloc_c_stream(ArrowArrayStream** c_stream) noexcept: - c_stream[0] = <ArrowArrayStream*> malloc(sizeof(ArrowArrayStream)) +cdef object alloc_c_array_stream(ArrowArrayStream** c_stream) noexcept: + c_stream[0] = <ArrowArrayStream*> ArrowMalloc(sizeof(ArrowArrayStream)) # Ensure the capsule destructor doesn't call a random release pointer c_stream[0].release = NULL - return PyCapsule_New(c_stream[0], 'arrow_array_stream', &pycapsule_stream_deleter) - - -cdef void arrow_array_release(ArrowArray* array) noexcept with gil: - Py_DECREF(<object>array.private_data) - array.private_data = NULL - array.release = NULL - - -cdef class SchemaHolder: - """Memory holder for an ArrowSchema - - This class is responsible for the lifecycle of the ArrowSchema - whose memory it is responsible for. When this object is deleted, - a non-NULL release callback is invoked. - """ - cdef ArrowSchema c_schema - - def __cinit__(self): - self.c_schema.release = NULL + return PyCapsule_New(c_stream[0], 'arrow_array_stream', &pycapsule_array_stream_deleter) - def __dealloc__(self): - if self.c_schema.release != NULL: - ArrowSchemaRelease(&self.c_schema) - def _addr(self): - return <uintptr_t>&self.c_schema +cdef void pycapsule_device_array_deleter(object device_array_capsule) noexcept: + cdef ArrowDeviceArray* device_array = <ArrowDeviceArray*>PyCapsule_GetPointer( + device_array_capsule, 'arrow_device_array' + ) + # Do not invoke the deleter on a used/moved capsule + if device_array.array.release != NULL: + device_array.array.release(&device_array.array) + ArrowFree(device_array) -cdef class ArrayHolder: - """Memory holder for an ArrowArray - This class is responsible for the lifecycle of the ArrowArray - whose memory it is responsible. When this object is deleted, - a non-NULL release callback is invoked. - """ - cdef ArrowArray c_array +cdef object alloc_c_device_array(ArrowDeviceArray** c_device_array) noexcept: + c_device_array[0] = <ArrowDeviceArray*> ArrowMalloc(sizeof(ArrowDeviceArray)) + # Ensure the capsule destructor doesn't call a random release pointer + c_device_array[0].array.release = NULL + return PyCapsule_New(c_device_array[0], 'arrow_device_array', &pycapsule_device_array_deleter) - def __cinit__(self): - self.c_array.release = NULL - def __dealloc__(self): - if self.c_array.release != NULL: - ArrowArrayRelease(&self.c_array) +cdef void pycapsule_array_view_deleter(object array_capsule) noexcept: + cdef ArrowArrayView* array_view = <ArrowArrayView*>PyCapsule_GetPointer( + array_capsule, 'nanoarrow_array_view' + ) - def _addr(self): - return <uintptr_t>&self.c_array + ArrowArrayViewReset(array_view) -cdef class ArrayStreamHolder: - """Memory holder for an ArrowArrayStream + ArrowFree(array_view) - This class is responsible for the lifecycle of the ArrowArrayStream - whose memory it is responsible. When this object is deleted, - a non-NULL release callback is invoked. - """ - cdef ArrowArrayStream c_array_stream - def __cinit__(self): - self.c_array_stream.release = NULL +cdef object alloc_c_array_view(ArrowArrayView** c_array_view) noexcept: + c_array_view[0] = <ArrowArrayView*> ArrowMalloc(sizeof(ArrowArrayView)) + ArrowArrayViewInitFromType(c_array_view[0], NANOARROW_TYPE_UNINITIALIZED) + return PyCapsule_New(c_array_view[0], 'nanoarrow_array_view', &pycapsule_array_view_deleter) - def __dealloc__(self): - if self.c_array_stream.release != NULL: - ArrowArrayStreamRelease(&self.c_array_stream) - def _addr(self): - return <uintptr_t>&self.c_array_stream +# To more safely implement export of an ArrowArray whose address may be Review Comment: FWIW you can also add this as a normal docstring to the function ########## python/src/nanoarrow/_lib.pyx: ########## @@ -352,57 +326,45 @@ cdef class Schema: def metadata(self): self._assert_valid() if self._ptr.metadata != NULL: - return SchemaMetadata(self, <uintptr_t>self._ptr.metadata) + return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata) else: return None @property - def children(self): + def n_children(self): + self._assert_valid() + return self._ptr.n_children + + def child(self, int64_t i): self._assert_valid() - return SchemaChildren(self) + if i < 0 or i >= self._ptr.n_children: + raise IndexError(f"{i} out of range [0, {self._ptr.n_children})") + + return CSchema(self._base, <uintptr_t>self._ptr.children[i]) + + @property + def children(self): + for i in range(self.n_children): + yield self.child(i) @property def dictionary(self): self._assert_valid() if self._ptr.dictionary != NULL: - return Schema(self, <uintptr_t>self._ptr.dictionary) + return CSchema(self, <uintptr_t>self._ptr.dictionary) else: return None - def view(self): Review Comment: We could still keep this method for convenience? (so you don't have to pass your schema object to two different functions) ########## python/src/nanoarrow/_lib.pyx: ########## @@ -352,57 +326,45 @@ cdef class Schema: def metadata(self): self._assert_valid() if self._ptr.metadata != NULL: - return SchemaMetadata(self, <uintptr_t>self._ptr.metadata) + return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata) else: return None @property - def children(self): + def n_children(self): + self._assert_valid() + return self._ptr.n_children + + def child(self, int64_t i): self._assert_valid() - return SchemaChildren(self) + if i < 0 or i >= self._ptr.n_children: + raise IndexError(f"{i} out of range [0, {self._ptr.n_children})") + + return CSchema(self._base, <uintptr_t>self._ptr.children[i]) + + @property + def children(self): + for i in range(self.n_children): + yield self.child(i) @property def dictionary(self): self._assert_valid() if self._ptr.dictionary != NULL: - return Schema(self, <uintptr_t>self._ptr.dictionary) + return CSchema(self, <uintptr_t>self._ptr.dictionary) else: return None - def view(self): Review Comment: One reason this would be useful is because the SchemaView doesn't give you access to the children (right? that's maybe also something that could be changed). So if you want to have a view of a child of a schema, you need something like `na.c_schema_view(na.c_schema(schema_obj).child(0))`? ########## python/src/nanoarrow/_lib_utils.py: ########## @@ -74,17 +79,134 @@ def array_repr(array, indent=0): else: lines.append(f"{indent_str}- dictionary: NULL") - children = array.children - lines.append(f"{indent_str}- children[{len(children)}]:") - for child in children: + lines.append(f"{indent_str}- children[{array.n_children}]:") + for child in array.children: child_repr = array_repr(child, indent=indent + 4) lines.append(f"{indent_str} {repr(child.schema.name)}: {child_repr}") return "\n".join(lines) +def schema_view_repr(schema_view): + lines = [ + "<nanoarrow.c_lib.CSchemaView>", + f"- type: {repr(schema_view.type)}", + f"- storage_type: {repr(schema_view.storage_type)}", + ] + + for attr_name in sorted(dir(schema_view)): + if attr_name.startswith("_") or attr_name in ("type", "storage_type"): + continue + + attr_value = getattr(schema_view, attr_name) + if attr_value is None: + continue + + lines.append(f"- {attr_name}: {repr(attr_value)}") + + return "\n".join(lines) Review Comment: Do we want to show something about the children here? Because right now for example for a list type, the schema view repr is less informative than the main schema repr: ``` In [68]: schema Out[68]: a: int64 b: list<item: double> child 0, item: double In [69]: na.c_schema(schema).child(1) Out[69]: <nanoarrow.c_lib.CSchema list> - format: '+l' - name: 'b' - flags: 2 - metadata: NULL - dictionary: NULL - children[1]: 'item': <nanoarrow.c_lib.CSchema double> - format: 'g' - name: 'item' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: In [70]: na.c_schema_view(na.c_schema(schema).child(1)) Out[70]: <nanoarrow.c_lib.CSchemaView> - type: 'list' - storage_type: 'list' ``` So the schema view repr doesn't say what type of list it is (just "list") ########## python/src/nanoarrow/_lib.pyx: ########## @@ -890,50 +790,129 @@ cdef class BufferView: self._element_size_bits = element_size_bits self._strides = self._item_size() self._shape = self._ptr.size_bytes // self._strides + self._format[0] = 0 + self._populate_format() + + def _addr(self): + return <uintptr_t>self._ptr.data.data + @property + def device_type(self): + return self._device.device_type + + @property + def device_id(self): + return self._device.device_id + + @property + def element_size_bits(self): + return self._element_size_bits + + @property + def size_bytes(self): + return self._ptr.size_bytes + + @property + def type(self): + if self._buffer_type == NANOARROW_BUFFER_TYPE_VALIDITY: + return "validity" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_TYPE_ID: + return "type_id" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_UNION_OFFSET: + return "union_offset" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA_OFFSET: + return "data_offset" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA: + return "data" + + @property + def data_type(self): + return ArrowTypeString(self._buffer_data_type).decode("UTF-8") + + @property + def format(self): + return self._format.decode("UTF-8") + + @property + def item_size(self): + return self._strides + + def __len__(self): + return self._shape + + def __getitem__(self, int64_t i): + if i < 0 or i >= self._shape: + raise IndexError(f"Index {i} out of range") + cdef int64_t offset = self._strides * i + value = unpack_from(self.format, buffer=self, offset=offset) + if len(value) == 1: + return value[0] + else: + return value + + def __iter__(self): + for value in iter_unpack(self.format, self): + if len(value) == 1: + yield value[0] + else: + yield value Review Comment: Hmm, it seems that this doesn't work with the endianness "=" you added below to the format type of the buffer protocol ########## python/src/nanoarrow/_lib.pyx: ########## @@ -352,57 +326,45 @@ cdef class Schema: def metadata(self): self._assert_valid() if self._ptr.metadata != NULL: - return SchemaMetadata(self, <uintptr_t>self._ptr.metadata) + return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata) else: return None @property - def children(self): + def n_children(self): + self._assert_valid() + return self._ptr.n_children + + def child(self, int64_t i): self._assert_valid() - return SchemaChildren(self) + if i < 0 or i >= self._ptr.n_children: + raise IndexError(f"{i} out of range [0, {self._ptr.n_children})") + + return CSchema(self._base, <uintptr_t>self._ptr.children[i]) + + @property + def children(self): + for i in range(self.n_children): + yield self.child(i) @property def dictionary(self): self._assert_valid() if self._ptr.dictionary != NULL: - return Schema(self, <uintptr_t>self._ptr.dictionary) + return CSchema(self, <uintptr_t>self._ptr.dictionary) else: return None - def view(self): - self._assert_valid() - schema_view = SchemaView() - cdef Error error = Error() - cdef int result = ArrowSchemaViewInit(&schema_view._schema_view, self._ptr, &error.c_error) - if result != NANOARROW_OK: - error.raise_message("ArrowSchemaViewInit()", result) - return schema_view +cdef class CSchemaView: + """Low-level ArrowSchemaView wrapper + This object is a literal wrapper around a read-only ArrowSchema. It provides field accessors + that return Python objects and handles structure lifecycle. -cdef class SchemaView: - """ArrowSchemaView wrapper - - The ArrowSchemaView is a nanoarrow C library structure that facilitates - access to the deserialized content of an ArrowSchema (e.g., parameter - values for parameterized types). This wrapper extends that facility to Python. - - Examples - -------- - - >>> import pyarrow as pa - >>> import nanoarrow as na - >>> schema = na.schema(pa.decimal128(10, 3)) - >>> schema_view = schema.view() - >>> schema_view.type - 'decimal128' - >>> schema_view.decimal_bitwidth - 128 - >>> schema_view.decimal_precision - 10 - >>> schema_view.decimal_scale - 3 + See `nanoarrow.c_schema_view()` for construction and usage examples. """ + cdef object _base Review Comment: ```suggestion cdef CSchema _base ``` ? (and if that is correct, could maybe also use more explicit name) ########## python/src/nanoarrow/_lib.pyx: ########## @@ -352,57 +326,45 @@ cdef class Schema: def metadata(self): self._assert_valid() if self._ptr.metadata != NULL: - return SchemaMetadata(self, <uintptr_t>self._ptr.metadata) + return SchemaMetadata(self._base, <uintptr_t>self._ptr.metadata) else: return None @property - def children(self): + def n_children(self): + self._assert_valid() + return self._ptr.n_children + + def child(self, int64_t i): self._assert_valid() - return SchemaChildren(self) + if i < 0 or i >= self._ptr.n_children: + raise IndexError(f"{i} out of range [0, {self._ptr.n_children})") + + return CSchema(self._base, <uintptr_t>self._ptr.children[i]) + + @property + def children(self): + for i in range(self.n_children): + yield self.child(i) @property def dictionary(self): self._assert_valid() if self._ptr.dictionary != NULL: - return Schema(self, <uintptr_t>self._ptr.dictionary) + return CSchema(self, <uintptr_t>self._ptr.dictionary) else: return None - def view(self): - self._assert_valid() - schema_view = SchemaView() - cdef Error error = Error() - cdef int result = ArrowSchemaViewInit(&schema_view._schema_view, self._ptr, &error.c_error) - if result != NANOARROW_OK: - error.raise_message("ArrowSchemaViewInit()", result) - return schema_view +cdef class CSchemaView: + """Low-level ArrowSchemaView wrapper + This object is a literal wrapper around a read-only ArrowSchema. It provides field accessors + that return Python objects and handles structure lifecycle. -cdef class SchemaView: - """ArrowSchemaView wrapper - - The ArrowSchemaView is a nanoarrow C library structure that facilitates - access to the deserialized content of an ArrowSchema (e.g., parameter Review Comment: I would keep this content in the new docstring, as it's still useful to explain the difference between CSchema end CSchemaView (for a user of the python library not familiar with the nanoarrow c details) ########## python/src/nanoarrow/_lib.pyx: ########## @@ -804,64 +761,6 @@ cdef class SchemaMetadata: yield key_obj, value_obj -cdef class ArrayChildren: Review Comment: Nice to see those Children classes removed! ;) ########## python/src/nanoarrow/_lib.pyx: ########## @@ -630,65 +574,71 @@ cdef class Array: @property def null_count(self): + self._assert_valid() return self._ptr.null_count + @property + def n_buffers(self): + self._assert_valid() + return self._ptr.n_buffers + @property def buffers(self): + self._assert_valid() return tuple(<uintptr_t>self._ptr.buffers[i] for i in range(self._ptr.n_buffers)) + @property + def n_children(self): + self._assert_valid() + return self._ptr.n_children + + def child(self, int64_t i): + self._assert_valid() + if i < 0 or i >= self._ptr.n_children: + raise IndexError(f"{i} out of range [0, {self._ptr.n_children})") + return CArray(self._base, <uintptr_t>self._ptr.children[i], self._schema.child(i)) + @property def children(self): - return ArrayChildren(self) + for i in range(self.n_children): + yield self.child(i) @property def dictionary(self): self._assert_valid() if self._ptr.dictionary != NULL: - return Array(self, <uintptr_t>self._ptr.dictionary, self._schema.dictionary) + return CArray(self, <uintptr_t>self._ptr.dictionary, self._schema.dictionary) else: return None def __repr__(self): - return array_repr(self) - - -cdef class ArrayView: - """ArrowArrayView wrapper - - The ArrowArrayView is a nanoarrow C library structure that provides - structured access to buffers addresses, buffer sizes, and buffer - data types. The buffer data is usually propagated from an ArrowArray - but can also be propagated from other types of objects (e.g., serialized - IPC). The offset and length of this view are independent of its parent - (i.e., this object can also represent a slice of its parent). Review Comment: Same comment here about the docstring ########## python/src/nanoarrow/_lib.pyx: ########## @@ -233,39 +218,23 @@ cdef class Error: raise NanoarrowException(what, code, "") -cdef class Schema: - """ArrowSchema wrapper - - This class provides a user-facing interface to access the fields of - an ArrowSchema as defined in the Arrow C Data interface. These objects - are usually created using `nanoarrow.schema()`. This Python wrapper - allows access to schema fields but does not automatically deserialize - their content: use `.view()` to validate and deserialize the content - into a more easily inspectable object. - - Examples - -------- - - >>> import pyarrow as pa - >>> import nanoarrow as na - >>> schema = na.schema(pa.int32()) - >>> schema.is_valid() - True - >>> schema.format - 'i' - >>> schema.name - '' - >>> schema_view = schema.view() - >>> schema_view.type - 'int32' +cdef class CSchema: + """Low-level ArrowSchema wrapper + + This object is a literal wrapper around a read-only ArrowSchema. It provides field accessors + that return Python objects and handles the C Data interface lifecycle (i.e., initialized + ArrowSchema structures are always released). + + See `nanoarrow.c_schema()` for construction and usage examples. """ cdef object _base Review Comment: This `_base` is now always a capsule? (if so, maybe add a comment saying that) ########## python/src/nanoarrow/_lib.pyx: ########## @@ -24,45 +24,50 @@ This Cython extension provides low-level Python wrappers around the Arrow C Data and Arrow C Stream interface structs. In general, there is one wrapper per C struct and pointer validity is managed by keeping strong references to Python objects. These wrappers are intended to -be literal and stay close to the structure definitions. +be literal and stay close to the structure definitions: higher level +interfaces can and should be built in Python where it is faster to +iterate and where it is easier to create a better user experience +by default (i.e., classes, methods, and functions implemented in Python +generally have better autocomplete + documentation available to IDEs). """ from libc.stdint cimport uintptr_t, int64_t -from libc.stdlib cimport malloc, free from libc.string cimport memcpy -from cpython.mem cimport PyMem_Malloc, PyMem_Free +from libc.stdio cimport snprintf from cpython.bytes cimport PyBytes_FromStringAndSize -from cpython.pycapsule cimport PyCapsule_New, PyCapsule_GetPointer, PyCapsule_CheckExact +from cpython.pycapsule cimport PyCapsule_New, PyCapsule_GetPointer from cpython cimport Py_buffer -from cpython.ref cimport PyObject, Py_INCREF, Py_DECREF +from cpython.ref cimport Py_INCREF, Py_DECREF from nanoarrow_c cimport * from nanoarrow_device_c cimport * -from nanoarrow._lib_utils import array_repr, device_array_repr, schema_repr, device_repr +from struct import unpack_from, iter_unpack +from nanoarrow import _lib_utils def c_version(): """Return the nanoarrow C library version string """ return ArrowNanoarrowVersion().decode("UTF-8") +# PyCapsule utilities # -# PyCapsule export utilities -# - - +# PyCapsules are used (1) to safely manage memory associated with C structures +# by initializing them and ensuring the appropriate cleanup is invoked when +# the object is deleted; and (2) as an export mechanism conforming to the +# Arrow PyCapsule interface for the objects where this is defined. cdef void pycapsule_schema_deleter(object schema_capsule) noexcept: cdef ArrowSchema* schema = <ArrowSchema*>PyCapsule_GetPointer( schema_capsule, 'arrow_schema' ) if schema.release != NULL: ArrowSchemaRelease(schema) - free(schema) + ArrowFree(schema) Review Comment: For my education: is there a benefit in using the nanoarrow version? ########## python/src/nanoarrow/_lib.pyx: ########## @@ -630,65 +574,71 @@ cdef class Array: @property def null_count(self): + self._assert_valid() return self._ptr.null_count + @property + def n_buffers(self): + self._assert_valid() + return self._ptr.n_buffers + @property def buffers(self): + self._assert_valid() return tuple(<uintptr_t>self._ptr.buffers[i] for i in range(self._ptr.n_buffers)) + @property + def n_children(self): + self._assert_valid() + return self._ptr.n_children + + def child(self, int64_t i): + self._assert_valid() + if i < 0 or i >= self._ptr.n_children: + raise IndexError(f"{i} out of range [0, {self._ptr.n_children})") + return CArray(self._base, <uintptr_t>self._ptr.children[i], self._schema.child(i)) + @property def children(self): - return ArrayChildren(self) + for i in range(self.n_children): + yield self.child(i) @property def dictionary(self): self._assert_valid() if self._ptr.dictionary != NULL: - return Array(self, <uintptr_t>self._ptr.dictionary, self._schema.dictionary) + return CArray(self, <uintptr_t>self._ptr.dictionary, self._schema.dictionary) else: return None def __repr__(self): - return array_repr(self) - - -cdef class ArrayView: - """ArrowArrayView wrapper - - The ArrowArrayView is a nanoarrow C library structure that provides - structured access to buffers addresses, buffer sizes, and buffer - data types. The buffer data is usually propagated from an ArrowArray - but can also be propagated from other types of objects (e.g., serialized - IPC). The offset and length of this view are independent of its parent - (i.e., this object can also represent a slice of its parent). - - Examples - -------- - - >>> import pyarrow as pa - >>> import numpy as np - >>> import nanoarrow as na - >>> array = na.array(pa.array(["one", "two", "three", None])) - >>> array_view = na.array_view(array) - >>> np.array(array_view.buffers[1]) - array([ 0, 3, 6, 11, 11], dtype=int32) - >>> np.array(array_view.buffers[2]) - array([b'o', b'n', b'e', b't', b'w', b'o', b't', b'h', b'r', b'e', b'e'], - dtype='|S1') + return _lib_utils.array_repr(self) + + +cdef class CArrayView: + """Low-level ArrowArrayView wrapper + + This object is a literal wrapper around an ArrowArrayView. It provides field accessors + that return Python objects and handles the structure lifecycle (i.e., initialized + ArrowArrayView structures are always released). + + See `nanoarrow.c_array_view()` for construction and usage examples. """ cdef object _base cdef ArrowArrayView* _ptr cdef ArrowDevice* _device - cdef Schema _schema - cdef object _base_buffer - def __cinit__(self, object base, uintptr_t addr, Schema schema, object base_buffer): + def __cinit__(self, object base, uintptr_t addr): self._base = base self._ptr = <ArrowArrayView*>addr - self._schema = schema - self._base_buffer = base_buffer self._device = ArrowDeviceCpu() + @property + def storage_type(self): Review Comment: I see that `storage_type` already existed in the SchemaView before, but what is exactly the difference with `type`? ########## python/src/nanoarrow/_lib.pyx: ########## @@ -947,88 +926,28 @@ cdef class BufferView: def __releasebuffer__(self, Py_buffer *buffer): pass + def __repr__(self): + return _lib_utils.buffer_view_repr(self) Review Comment: It might be nice to include a name here as well for the standalone repr (the util function only gives you the content, which is useful for including it into another repr). Something like ```suggestion return f"nanoarrow.c_lib.BufferView {_lib_utils.buffer_view_repr(self)[1:]}" ``` (the slicing is because it already starts with a `<` (that could also be changed in the util function) ########## python/src/nanoarrow/_lib.pyx: ########## @@ -890,50 +790,129 @@ cdef class BufferView: self._element_size_bits = element_size_bits self._strides = self._item_size() self._shape = self._ptr.size_bytes // self._strides + self._format[0] = 0 + self._populate_format() + + def _addr(self): + return <uintptr_t>self._ptr.data.data + @property + def device_type(self): + return self._device.device_type + + @property + def device_id(self): + return self._device.device_id + + @property + def element_size_bits(self): + return self._element_size_bits + + @property + def size_bytes(self): + return self._ptr.size_bytes + + @property + def type(self): + if self._buffer_type == NANOARROW_BUFFER_TYPE_VALIDITY: + return "validity" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_TYPE_ID: + return "type_id" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_UNION_OFFSET: + return "union_offset" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA_OFFSET: + return "data_offset" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA: + return "data" + + @property + def data_type(self): + return ArrowTypeString(self._buffer_data_type).decode("UTF-8") + + @property + def format(self): + return self._format.decode("UTF-8") + + @property + def item_size(self): + return self._strides + + def __len__(self): + return self._shape + + def __getitem__(self, int64_t i): + if i < 0 or i >= self._shape: + raise IndexError(f"Index {i} out of range") + cdef int64_t offset = self._strides * i + value = unpack_from(self.format, buffer=self, offset=offset) + if len(value) == 1: + return value[0] + else: + return value + + def __iter__(self): + for value in iter_unpack(self.format, self): + if len(value) == 1: + yield value[0] + else: + yield value Review Comment: A Python memoryview object supports this kind of indexing, and a conversion to a python list as well (https://docs.python.org/3/library/stdtypes.html#memoryview.tolist). So a potential alternative is to reuse that (`memoryview(self).tolist())` might work out of the box) ########## .isort.cfg: ########## @@ -0,0 +1,23 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[settings] Review Comment: (btw, for another PR, but I would also switch to use ruff for linting, that also includes the functionality of isort) ########## python/src/nanoarrow/_lib.pyx: ########## @@ -890,50 +790,129 @@ cdef class BufferView: self._element_size_bits = element_size_bits self._strides = self._item_size() self._shape = self._ptr.size_bytes // self._strides + self._format[0] = 0 + self._populate_format() + + def _addr(self): + return <uintptr_t>self._ptr.data.data + @property + def device_type(self): + return self._device.device_type + + @property + def device_id(self): + return self._device.device_id + + @property + def element_size_bits(self): + return self._element_size_bits + + @property + def size_bytes(self): + return self._ptr.size_bytes + + @property + def type(self): + if self._buffer_type == NANOARROW_BUFFER_TYPE_VALIDITY: + return "validity" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_TYPE_ID: + return "type_id" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_UNION_OFFSET: + return "union_offset" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA_OFFSET: + return "data_offset" + elif self._buffer_type == NANOARROW_BUFFER_TYPE_DATA: + return "data" + + @property + def data_type(self): + return ArrowTypeString(self._buffer_data_type).decode("UTF-8") + + @property + def format(self): + return self._format.decode("UTF-8") + + @property + def item_size(self): + return self._strides + + def __len__(self): + return self._shape + + def __getitem__(self, int64_t i): + if i < 0 or i >= self._shape: + raise IndexError(f"Index {i} out of range") + cdef int64_t offset = self._strides * i + value = unpack_from(self.format, buffer=self, offset=offset) + if len(value) == 1: + return value[0] + else: + return value + + def __iter__(self): + for value in iter_unpack(self.format, self): + if len(value) == 1: + yield value[0] + else: + yield value cdef Py_ssize_t _item_size(self): - if self._buffer_data_type == NANOARROW_TYPE_BOOL: - return 1 - elif self._buffer_data_type == NANOARROW_TYPE_STRING: - return 1 - elif self._buffer_data_type == NANOARROW_TYPE_BINARY: + if self._element_size_bits < 8: return 1 else: return self._element_size_bits // 8 - cdef const char* _get_format(self): - if self._buffer_data_type == NANOARROW_TYPE_INT8: - return "b" + cdef void _populate_format(self): + cdef const char* format_const = NULL + if self._element_size_bits == 0: + # Variable-size elements (e.g., data buffer for string or binary) export as + # one byte per element (character if string, unspecified binary otherwise) + if self._buffer_data_type == NANOARROW_TYPE_STRING: + format_const = "c" + else: + format_const = "B" + elif self._element_size_bits < 8: + # Bitmaps export as unspecified binary + format_const = "B" + elif self._buffer_data_type == NANOARROW_TYPE_INT8: + format_const = "b" elif self._buffer_data_type == NANOARROW_TYPE_UINT8: - return "B" + format_const = "B" elif self._buffer_data_type == NANOARROW_TYPE_INT16: - return "h" + format_const = "=h" elif self._buffer_data_type == NANOARROW_TYPE_UINT16: - return "H" + format_const = "=H" elif self._buffer_data_type == NANOARROW_TYPE_INT32: - return "i" + format_const = "=i" elif self._buffer_data_type == NANOARROW_TYPE_UINT32: - return "I" + format_const = "=I" elif self._buffer_data_type == NANOARROW_TYPE_INT64: - return "l" + format_const = "=q" elif self._buffer_data_type == NANOARROW_TYPE_UINT64: - return "L" + format_const = "=Q" + elif self._buffer_data_type == NANOARROW_TYPE_HALF_FLOAT: + format_const = "=e" elif self._buffer_data_type == NANOARROW_TYPE_FLOAT: - return "f" + format_const = "=f" elif self._buffer_data_type == NANOARROW_TYPE_DOUBLE: - return "d" - elif self._buffer_data_type == NANOARROW_TYPE_STRING: - return "c" + format_const = "=d" + elif self._buffer_data_type == NANOARROW_TYPE_INTERVAL_DAY_TIME: + format_const = "=ii" + elif self._buffer_data_type == NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO: + format_const = "=iiq" + + if format_const != NULL: + snprintf(self._format, sizeof(self._format), "%s", format_const) else: - return "B" + snprintf(self._format, sizeof(self._format), "%ds", self._element_size_bits // 8) Review Comment: Why is this needed (compared to just returning the string as was done before)? ########## python/README.md: ########## @@ -43,97 +43,129 @@ If you can import the namespace, you're good to go! import nanoarrow as na ``` -## Example +## Low-level C library bindings -The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. All three can be wrapped by Python objects using the nanoarrow Python package. +The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. ### Schemas -Use `nanoarrow.schema()` to convert a data type-like object to an `ArrowSchema`. This is currently only implemented for pyarrow objects. +Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`). ```python import pyarrow as pa -schema = na.schema(pa.decimal128(10, 3)) +schema = na.c_schema(pa.decimal128(10, 3)) +schema ``` -You can extract the fields of a `Schema` object one at a time or parse it into a view to extract deserialized parameters. + + + + <nanoarrow.c_lib.CSchema decimal128(10, 3)> + - format: 'd:10,3' + - name: '' + - flags: 2 + - metadata: NULL + - dictionary: NULL + - children[0]: + + + +You can extract the fields of a `CSchema` object one at a time or parse it into a view to extract deserialized parameters. ```python -print(schema.format) -print(schema.view().decimal_precision) -print(schema.view().decimal_scale) +na.c_schema_view(schema) ``` - d:10,3 - 10 - 3 -The `nanoarrow.schema()` helper is currently only implemented for pyarrow objects. If your data type has an `_export_to_c()`-like function, you can get the address of a freshly-allocated `ArrowSchema` as well: + + <nanoarrow.c_lib.CSchemaView> + - type: 'decimal128' + - storage_type: 'decimal128' + - decimal_bitwidth: 128 + - decimal_precision: 10 + - decimal_scale: 3 + + + +Advanced users can allocate an empty `CSchema` and populate its contents by passing its `._addr()` to a schema-exporting function. ```python -schema = na.Schema.allocate() +schema = na.c_schema() Review Comment: But seeing the version below for Array, I admit that there it is a little inconvenient you need pass an allocated schema to the Array allocation (although this could also be done for the user automatically?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
