This is an automated email from the ASF dual-hosted git repository.
paleolimbot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-nanoarrow.git
The following commit(s) were added to refs/heads/main by this push:
new fcc540a8 docs(python): Update Python bindings readme (#474)
fcc540a8 is described below
commit fcc540a8fabe03a38f07e010f8a72c733e18a4a8
Author: Dewey Dunnington <[email protected]>
AuthorDate: Fri May 17 16:22:00 2024 -0300
docs(python): Update Python bindings readme (#474)
The previous readme was written for the previous release and is
outdated!
---
python/README.ipynb | 480 +++++++++++++++++++++++++++++++++++++++++-----------
python/README.md | 318 ++++++++++++++++++++++++++--------
2 files changed, 624 insertions(+), 174 deletions(-)
diff --git a/python/README.ipynb b/python/README.ipynb
index 0f13829a..5d62065b 100644
--- a/python/README.ipynb
+++ b/python/README.ipynb
@@ -36,11 +36,19 @@
"\n",
"## Installation\n",
"\n",
- "Python bindings for nanoarrow are not yet available on PyPI. You can
install via\n",
- "URL (requires a C compiler):\n",
+ "The nanoarrow Python bindings are available from
[PyPI](https://pypi.org/) and\n",
+ "[conda-forge](https://conda-forge.org/):\n",
"\n",
- "```bash\n",
- "python -m pip install
\"git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python\"\n",
+ "```shell\n",
+ "pip install nanoarrow\n",
+ "conda install nanoarrow -c conda-forge\n",
+ "```\n",
+ "\n",
+ "Development versions (based on the `main` branch) are also available:\n",
+ "\n",
+ "```shell\n",
+ "pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n",
+ " --prefer-binary --pre nanoarrow\n",
"```\n",
"\n",
"If you can import the namespace, you're good to go!"
@@ -48,7 +56,7 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
@@ -56,102 +64,326 @@
]
},
{
- "attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Low-level C library bindings\n",
- "\n",
- "The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the `ArrowSchema` which represents a data type of an array, the
`ArrowArray` which represents the values of an array, and an
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common
`ArrowSchema`.\n",
+ "## Data types, arrays, and array streams\n",
"\n",
- "### Schemas\n",
- "\n",
- "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and
wrap it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
+ "The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the `ArrowSchema` which represents a data type of an array, the
`ArrowArray` which represents the values of an array, and an
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common
`ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`,
and `nanoarrow.ArrayStream` in the Python package."
]
},
{
"cell_type": "code",
- "execution_count": 2,
+ "execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "<nanoarrow.c_lib.CSchema decimal128(10, 3)>\n",
- "- format: 'd:10,3'\n",
- "- name: ''\n",
- "- flags: 2\n",
- "- metadata: NULL\n",
- "- dictionary: NULL\n",
- "- children[0]:"
+ "<Schema> int32"
]
},
- "execution_count": 2,
+ "execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "import pyarrow as pa\n",
- "schema = na.c_schema(pa.decimal128(10, 3))\n",
- "schema"
+ "na.int32()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "nanoarrow.Array<int32>[3]\n",
+ "1\n",
+ "2\n",
+ "3"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "na.Array([1, 2, 3], na.int32())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `nanoarrow.Array` can accommodate arrays with any number of chunks,
reflecting the reality that many array containers (e.g.,
`pyarrow.ChunkedArray`, `polars.Series`) support this."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "nanoarrow.Array<int32>[6]\n",
+ "1\n",
+ "2\n",
+ "3\n",
+ "4\n",
+ "5\n",
+ "6"
+ ]
+ },
+ "execution_count": 49,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n",
+ "chunked"
]
},
{
- "attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
- "You can extract the fields of a `CSchema` object one at a time or parse
it into a view to extract deserialized parameters."
+ "Whereas chunks of an `Array` are always fully materialized when the
object is constructed, the chunks of an `ArrayStream` have not necessarily been
resolved yet."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "nanoarrow.ArrayStream<int32>"
+ ]
+ },
+ "execution_count": 50,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "stream = na.ArrayStream(chunked)\n",
+ "stream"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "nanoarrow.Array<int32>[3]\n",
+ "1\n",
+ "2\n",
+ "3\n",
+ "nanoarrow.Array<int32>[3]\n",
+ "4\n",
+ "5\n",
+ "6\n"
+ ]
+ }
+ ],
+ "source": [
+ "with stream:\n",
+ " for chunk in stream:\n",
+ " print(chunk)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's
[Arrow
IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
reader:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "nanoarrow.ArrayStream<non-nullable struct<commit: string, time:
timestamp('us', 'UTC'), files: int3...>"
+ ]
+ },
+ "execution_count": 52,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "url =
\"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n",
+ "na.ArrayStream.from_url(url)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "These objects implement the [Arrow PyCapsule
interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
for both producing and consuming and are interchangeable with `pyarrow`
objects in many cases:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "pyarrow.Field<: int32>"
+ ]
+ },
+ "execution_count": 53,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import pyarrow as pa\n",
+ "\n",
+ "pa.field(na.int32())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n",
+ "[\n",
+ " [\n",
+ " 1,\n",
+ " 2,\n",
+ " 3\n",
+ " ],\n",
+ " [\n",
+ " 4,\n",
+ " 5,\n",
+ " 6\n",
+ " ]\n",
+ "]"
+ ]
+ },
+ "execution_count": 54,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pa.chunked_array(chunked)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 55,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<pyarrow.lib.Int32Array object at 0x11b552500>\n",
+ "[\n",
+ " 4,\n",
+ " 5,\n",
+ " 6\n",
+ "]"
+ ]
+ },
+ "execution_count": 55,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pa.array(chunked.chunk(1))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "nanoarrow.Array<int64>[3]\n",
+ "10\n",
+ "11\n",
+ "12"
+ ]
+ },
+ "execution_count": 56,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "na.Array(pa.array([10, 11, 12]))"
]
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "<nanoarrow.c_lib.CSchemaView>\n",
- "- type: 'decimal128'\n",
- "- storage_type: 'decimal128'\n",
- "- decimal_bitwidth: 128\n",
- "- decimal_precision: 10\n",
- "- decimal_scale: 3\n",
- "- dictionary_ordered: False\n",
- "- map_keys_sorted: False\n",
- "- nullable: True\n",
- "- storage_type_id: 24\n",
- "- type_id: 24"
+ "<Schema> string"
]
},
- "execution_count": 3,
+ "execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "na.c_schema_view(schema)"
+ "na.Schema(pa.string())"
]
},
{
+ "attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
- "Advanced users can allocate an empty `CSchema` and populate its contents
by passing its `._addr()` to a schema-exporting function."
+ "## Low-level C library bindings\n",
+ "\n",
+ "The nanoarrow Python package also provides lower level wrappers around
Arrow C interface structures. You can create these using
`nanoarrow.c_schema()`, `nanoarrow.c_array()`, and
`nanoarrow.c_array_stream()`.\n",
+ "\n",
+ "### Schemas\n",
+ "\n",
+ "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and
wrap it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
]
},
{
"cell_type": "code",
- "execution_count": 4,
+ "execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "<nanoarrow.c_lib.CSchema int32>\n",
- "- format: 'i'\n",
+ "<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n",
+ "- format: 'd:10,3'\n",
"- name: ''\n",
"- flags: 2\n",
"- metadata: NULL\n",
@@ -159,15 +391,41 @@
"- children[0]:"
]
},
- "execution_count": 4,
+ "execution_count": 58,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "na.c_schema(pa.decimal128(10, 3))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using `c_schema()` is a good fit for testing and for ephemeral schema
objects that are being passed from one library to another. To extract the
fields of a schema in a more convenient form, use `Schema()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(10, 3)"
+ ]
+ },
+ "execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "schema = na.allocate_c_schema()\n",
- "pa.int32()._export_to_c(schema._addr())\n",
- "schema"
+ "schema = na.Schema(pa.decimal128(10, 3))\n",
+ "schema.precision, schema.scale"
]
},
{
@@ -190,29 +448,28 @@
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "<nanoarrow.c_lib.CArray string>\n",
+ "<nanoarrow.c_array.CArray string>\n",
"- length: 4\n",
"- offset: 0\n",
"- null_count: 1\n",
- "- buffers: (3678035706048, 3678035705984, 3678035706112)\n",
+ "- buffers: (4754305168, 4754307808, 4754310464)\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
- "execution_count": 5,
+ "execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "array = na.c_array(pa.array([\"one\", \"two\", \"three\", None]))\n",
- "array"
+ "na.c_array([\"one\", \"two\", \"three\", None], na.string())"
]
},
{
@@ -220,67 +477,87 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "You can extract the fields of a `CArray` one at a time or parse it into a
view to extract deserialized content:"
+ "Using `c_array()` is a good fit for testing and for ephemeral array
objects that are being passed from one library to another. For a higher level
interface, use `Array()`:"
]
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "<nanoarrow.c_lib.CArrayView>\n",
- "- storage_type: 'string'\n",
- "- length: 4\n",
- "- offset: 0\n",
- "- null_count: 1\n",
- "- buffers[3]:\n",
- " - validity <bool[1 b] 11100000>\n",
- " - data_offset <int32[20 b] 0 3 6 11 11>\n",
- " - data <string[11 b] b'onetwothree'>\n",
- "- dictionary: NULL\n",
- "- children[0]:"
+ "['one', 'two', 'three', None]"
]
},
- "execution_count": 6,
+ "execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "na.c_array_view(array)"
+ "array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n",
+ "array.to_pylist()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n",
+ " nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n",
+ " nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))"
+ ]
+ },
+ "execution_count": 62,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "array.buffers"
]
},
{
- "attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
- "Like the `CSchema`, you can allocate an empty one and access its address
with `_addr()` to pass to other array-exporting functions."
+ "Advanced users can create arrays directly from buffers using
`c_array_from_buffers()`:"
]
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "3"
+ "<nanoarrow.c_array.CArray string>\n",
+ "- length: 2\n",
+ "- offset: 0\n",
+ "- null_count: 0\n",
+ "- buffers: (0, 5002908320, 4999694624)\n",
+ "- dictionary: NULL\n",
+ "- children[0]:"
]
},
- "execution_count": 7,
+ "execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "array = na.allocate_c_array()\n",
- "pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())\n",
- "array.length"
+ "na.c_array_from_buffers(\n",
+ " na.string(),\n",
+ " 2,\n",
+ " [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n",
+ ")"
]
},
{
@@ -290,30 +567,29 @@
"source": [
"### Array streams\n",
"\n",
- "You can use `nanoarrow.c_array_stream()` to wrap an object representing a
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap
it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.RecordBatchReader`)."
+ "You can use `nanoarrow.c_array_stream()` to wrap an object representing a
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap
it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)."
]
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "<nanoarrow.c_lib.CArrayStream>\n",
- "- get_schema(): struct<some_column: int32>"
+ "<nanoarrow.c_array_stream.CArrayStream>\n",
+ "- get_schema(): struct<col1: int64>"
]
},
- "execution_count": 8,
+ "execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "pa_array_child = pa.array([1, 2, 3], pa.int32())\n",
- "pa_array = pa.record_batch([pa_array_child], names=[\"some_column\"])\n",
- "reader = pa.RecordBatchReader.from_batches(pa_array.schema,
[pa_array])\n",
+ "pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n",
+ "reader = pa.RecordBatchReader.from_batches(pa_batch.schema,
[pa_batch])\n",
"array_stream = na.c_array_stream(reader)\n",
"array_stream"
]
@@ -328,25 +604,25 @@
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "<nanoarrow.c_lib.CArray struct<some_column: int32>>\n",
+ "<nanoarrow.c_array.CArray struct<col1: int64>>\n",
"- length: 3\n",
"- offset: 0\n",
"- null_count: 0\n",
"- buffers: (0,)\n",
"- dictionary: NULL\n",
"- children[1]:\n",
- " 'some_column': <nanoarrow.c_lib.CArray int32>\n",
+ " 'col1': <nanoarrow.c_array.CArray int64>\n",
" - length: 3\n",
" - offset: 0\n",
" - null_count: 0\n",
- " - buffers: (0, 3678035837056)\n",
+ " - buffers: (0, 2642948588352)\n",
" - dictionary: NULL\n",
" - children[0]:\n"
]
@@ -358,34 +634,34 @@
]
},
{
- "attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
- "You can also get the address of a freshly-allocated stream to pass to a
suitable exporting function:"
+ "Use `ArrayStream()` for a higher level interface:"
]
},
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "<nanoarrow.c_lib.CArrayStream>\n",
- "- get_schema(): struct<some_column: int32>"
+ "nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n",
+ "{'col1': 1}\n",
+ "{'col1': 2}\n",
+ "{'col1': 3}"
]
},
- "execution_count": 10,
+ "execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "array_stream = na.allocate_c_array_stream()\n",
- "reader._export_to_c(array_stream._addr())\n",
- "array_stream"
+ "reader = pa.RecordBatchReader.from_batches(pa_batch.schema,
[pa_batch])\n",
+ "na.ArrayStream(reader).read_all()"
]
},
{
@@ -408,11 +684,13 @@
"\n",
"```shell\n",
"# Install dependencies\n",
- "pip install -e .[test]\n",
+ "pip install -e \".[test]\"\n",
"\n",
"# Run tests\n",
"pytest -vvx\n",
- "```"
+ "```\n",
+ "\n",
+ "CMake is currently required to ensure that the vendored copy of nanoarrow
in the Python package stays in sync with the nanoarrow sources in the working
tree."
]
}
],
@@ -432,7 +710,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.11.4"
+ "version": "3.12.3"
},
"orig_nbformat": 4
},
diff --git a/python/README.md b/python/README.md
index 42b4e390..f279a095 100644
--- a/python/README.md
+++ b/python/README.md
@@ -29,11 +29,19 @@ interfaces.
## Installation
-Python bindings for nanoarrow are not yet available on PyPI. You can install
via
-URL (requires a C compiler):
+The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and
+[conda-forge](https://conda-forge.org/):
-```bash
-python -m pip install
"git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python"
+```shell
+pip install nanoarrow
+conda install nanoarrow -c conda-forge
+```
+
+Development versions (based on the `main` branch) are also available:
+
+```shell
+pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
+ --prefer-binary --pre nanoarrow
```
If you can import the namespace, you're good to go!
@@ -43,72 +51,207 @@ If you can import the namespace, you're good to go!
import nanoarrow as na
```
-## Low-level C library bindings
+## Data types, arrays, and array streams
-The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the `ArrowSchema` which represents a data type of an array, the
`ArrowArray` which represents the values of an array, and an
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common
`ArrowSchema`.
+The Arrow C Data and Arrow C Stream interfaces are comprised of three
structures: the `ArrowSchema` which represents a data type of an array, the
`ArrowArray` which represents the values of an array, and an
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common
`ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`,
and `nanoarrow.ArrayStream` in the Python package.
+
+
+```python
+na.int32()
+```
+
+
+
+
+ <Schema> int32
-### Schemas
-Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap
it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).
+
+
+```python
+na.Array([1, 2, 3], na.int32())
+```
+
+
+
+
+ nanoarrow.Array<int32>[3]
+ 1
+ 2
+ 3
+
+
+
+The `nanoarrow.Array` can accommodate arrays with any number of chunks,
reflecting the reality that many array containers (e.g.,
`pyarrow.ChunkedArray`, `polars.Series`) support this.
+
+
+```python
+chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
+chunked
+```
+
+
+
+
+ nanoarrow.Array<int32>[6]
+ 1
+ 2
+ 3
+ 4
+ 5
+ 6
+
+
+
+Whereas chunks of an `Array` are always fully materialized when the object is
constructed, the chunks of an `ArrayStream` have not necessarily been resolved
yet.
+
+
+```python
+stream = na.ArrayStream(chunked)
+stream
+```
+
+
+
+
+ nanoarrow.ArrayStream<int32>
+
+
+
+
+```python
+with stream:
+ for chunk in stream:
+ print(chunk)
+```
+
+ nanoarrow.Array<int32>[3]
+ 1
+ 2
+ 3
+ nanoarrow.Array<int32>[3]
+ 4
+ 5
+ 6
+
+
+The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow
IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
reader:
+
+
+```python
+url =
"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
+na.ArrayStream.from_url(url)
+```
+
+
+
+
+ nanoarrow.ArrayStream<non-nullable struct<commit: string, time:
timestamp('us', 'UTC'), files: int3...>
+
+
+
+These objects implement the [Arrow PyCapsule
interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
for both producing and consuming and are interchangeable with `pyarrow`
objects in many cases:
```python
import pyarrow as pa
-schema = na.c_schema(pa.decimal128(10, 3))
-schema
+
+pa.field(na.int32())
```
- <nanoarrow.c_lib.CSchema decimal128(10, 3)>
- - format: 'd:10,3'
- - name: ''
- - flags: 2
- - metadata: NULL
- - dictionary: NULL
- - children[0]:
+ pyarrow.Field<: int32>
+
+
+
+
+```python
+pa.chunked_array(chunked)
+```
+
+
+
+
+ <pyarrow.lib.ChunkedArray object at 0x12a49a250>
+ [
+ [
+ 1,
+ 2,
+ 3
+ ],
+ [
+ 4,
+ 5,
+ 6
+ ]
+ ]
+
+
+
+
+```python
+pa.array(chunked.chunk(1))
+```
+
+
+
+
+ <pyarrow.lib.Int32Array object at 0x11b552500>
+ [
+ 4,
+ 5,
+ 6
+ ]
+
+
+
+
+```python
+na.Array(pa.array([10, 11, 12]))
+```
+
+
+
+ nanoarrow.Array<int64>[3]
+ 10
+ 11
+ 12
-You can extract the fields of a `CSchema` object one at a time or parse it
into a view to extract deserialized parameters.
```python
-na.c_schema_view(schema)
+na.Schema(pa.string())
```
- <nanoarrow.c_lib.CSchemaView>
- - type: 'decimal128'
- - storage_type: 'decimal128'
- - decimal_bitwidth: 128
- - decimal_precision: 10
- - decimal_scale: 3
- - dictionary_ordered: False
- - map_keys_sorted: False
- - nullable: True
- - storage_type_id: 24
- - type_id: 24
+ <Schema> string
+
+
+
+## Low-level C library bindings
+The nanoarrow Python package also provides lower level wrappers around Arrow C
interface structures. You can create these using `nanoarrow.c_schema()`,
`nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.
+### Schemas
-Advanced users can allocate an empty `CSchema` and populate its contents by
passing its `._addr()` to a schema-exporting function.
+Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap
it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).
```python
-schema = na.allocate_c_schema()
-pa.int32()._export_to_c(schema._addr())
-schema
+na.c_schema(pa.decimal128(10, 3))
```
- <nanoarrow.c_lib.CSchema int32>
- - format: 'i'
+ <nanoarrow.c_schema.CSchema decimal128(10, 3)>
+ - format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
@@ -117,6 +260,21 @@ schema
+Using `c_schema()` is a good fit for testing and for ephemeral schema objects
that are being passed from one library to another. To extract the fields of a
schema in a more convenient form, use `Schema()`:
+
+
+```python
+schema = na.Schema(pa.decimal128(10, 3))
+schema.precision, schema.scale
+```
+
+
+
+
+ (10, 3)
+
+
+
The `CSchema` object cleans up after itself: when the object is deleted, the
underlying `ArrowSchema` is released.
### Arrays
@@ -125,72 +283,83 @@ You can use `nanoarrow.c_array()` to convert an
array-like object to an `ArrowAr
```python
-array = na.c_array(pa.array(["one", "two", "three", None]))
-array
+na.c_array(["one", "two", "three", None], na.string())
```
- <nanoarrow.c_lib.CArray string>
+ <nanoarrow.c_array.CArray string>
- length: 4
- offset: 0
- null_count: 1
- - buffers: (3678035706048, 3678035705984, 3678035706112)
+ - buffers: (4754305168, 4754307808, 4754310464)
- dictionary: NULL
- children[0]:
-You can extract the fields of a `CArray` one at a time or parse it into a view
to extract deserialized content:
+Using `c_array()` is a good fit for testing and for ephemeral array objects
that are being passed from one library to another. For a higher level
interface, use `Array()`:
```python
-na.c_array_view(array)
+array = na.Array(["one", "two", "three", None], na.string())
+array.to_pylist()
```
- <nanoarrow.c_lib.CArrayView>
- - storage_type: 'string'
- - length: 4
- - offset: 0
- - null_count: 1
- - buffers[3]:
- - validity <bool[1 b] 11100000>
- - data_offset <int32[20 b] 0 3 6 11 11>
- - data <string[11 b] b'onetwothree'>
- - dictionary: NULL
- - children[0]:
+ ['one', 'two', 'three', None]
+
+
+
+
+```python
+array.buffers
+```
+
+
+ (nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
+ nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
+ nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))
-Like the `CSchema`, you can allocate an empty one and access its address with
`_addr()` to pass to other array-exporting functions.
+
+
+Advanced users can create arrays directly from buffers using
`c_array_from_buffers()`:
```python
-array = na.allocate_c_array()
-pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())
-array.length
+na.c_array_from_buffers(
+ na.string(),
+ 2,
+ [None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
+)
```
- 3
+ <nanoarrow.c_array.CArray string>
+ - length: 2
+ - offset: 0
+ - null_count: 0
+ - buffers: (0, 5002908320, 4999694624)
+ - dictionary: NULL
+ - children[0]:
### Array streams
-You can use `nanoarrow.c_array_stream()` to wrap an object representing a
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap
it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.RecordBatchReader`).
+You can use `nanoarrow.c_array_stream()` to wrap an object representing a
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap
it as a Python object. This works for any object implementing the [Arrow
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
(e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`).
```python
-pa_array_child = pa.array([1, 2, 3], pa.int32())
-pa_array = pa.record_batch([pa_array_child], names=["some_column"])
-reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
+pa_batch = pa.record_batch({"col1": [1, 2, 3]})
+reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
array_stream = na.c_array_stream(reader)
array_stream
```
@@ -198,8 +367,8 @@ array_stream
- <nanoarrow.c_lib.CArrayStream>
- - get_schema(): struct<some_column: int32>
+ <nanoarrow.c_array_stream.CArrayStream>
+ - get_schema(): struct<col1: int64>
@@ -211,36 +380,37 @@ for array in array_stream:
print(array)
```
- <nanoarrow.c_lib.CArray struct<some_column: int32>>
+ <nanoarrow.c_array.CArray struct<col1: int64>>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
- 'some_column': <nanoarrow.c_lib.CArray int32>
+ 'col1': <nanoarrow.c_array.CArray int64>
- length: 3
- offset: 0
- null_count: 0
- - buffers: (0, 3678035837056)
+ - buffers: (0, 2642948588352)
- dictionary: NULL
- children[0]:
-You can also get the address of a freshly-allocated stream to pass to a
suitable exporting function:
+Use `ArrayStream()` for a higher level interface:
```python
-array_stream = na.allocate_c_array_stream()
-reader._export_to_c(array_stream._addr())
-array_stream
+reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
+na.ArrayStream(reader).read_all()
```
- <nanoarrow.c_lib.CArrayStream>
- - get_schema(): struct<some_column: int32>
+ nanoarrow.Array<non-nullable struct<col1: int64>>[3]
+ {'col1': 1}
+ {'col1': 2}
+ {'col1': 3}
@@ -264,3 +434,5 @@ pip install -e ".[test]"
# Run tests
pytest -vvx
```
+
+CMake is currently required to ensure that the vendored copy of nanoarrow in
the Python package stays in sync with the nanoarrow sources in the working tree.