(arrow-nanoarrow) branch main updated: docs(python): Update Python bindings readme (#474)

paleolimbot Fri, 17 May 2024 12:22:10 -0700

This is an automated email from the ASF dual-hosted git repository.

paleolimbot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-nanoarrow.git



The following commit(s) were added to refs/heads/main by this push:
     new fcc540a8 docs(python): Update Python bindings readme (#474)
fcc540a8 is described below

commit fcc540a8fabe03a38f07e010f8a72c733e18a4a8
Author: Dewey Dunnington <[email protected]>
AuthorDate: Fri May 17 16:22:00 2024 -0300

    docs(python): Update Python bindings readme (#474)
    
    The previous readme was written for the previous release and is
    outdated!
---
 python/README.ipynb | 480 +++++++++++++++++++++++++++++++++++++++++-----------
 python/README.md    | 318 ++++++++++++++++++++++++++--------
 2 files changed, 624 insertions(+), 174 deletions(-)

diff --git a/python/README.ipynb b/python/README.ipynb
index 0f13829a..5d62065b 100644
--- a/python/README.ipynb
+++ b/python/README.ipynb
@@ -36,11 +36,19 @@
     "\n",
     "## Installation\n",
     "\n",
-    "Python bindings for nanoarrow are not yet available on PyPI. You can 
install via\n",
-    "URL (requires a C compiler):\n",
+    "The nanoarrow Python bindings are available from 
[PyPI](https://pypi.org/) and\n",
+    "[conda-forge](https://conda-forge.org/):\n",
     "\n",
-    "```bash\n",
-    "python -m pip install 
\"git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python\"\n";,
+    "```shell\n",
+    "pip install nanoarrow\n",
+    "conda install nanoarrow -c conda-forge\n",
+    "```\n",
+    "\n",
+    "Development versions (based on the `main` branch) are also available:\n",
+    "\n",
+    "```shell\n",
+    "pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n",
+    "    --prefer-binary --pre nanoarrow\n",
     "```\n",
     "\n",
     "If you can import the namespace, you're good to go!"
@@ -48,7 +56,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 46,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -56,102 +64,326 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Low-level C library bindings\n",
-    "\n",
-    "The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`.\n",
+    "## Data types, arrays, and array streams\n",
     "\n",
-    "### Schemas\n",
-    "\n",
-    "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and 
wrap it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
+    "The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, 
and `nanoarrow.ArrayStream` in the Python package."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 47,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<nanoarrow.c_lib.CSchema decimal128(10, 3)>\n",
-       "- format: 'd:10,3'\n",
-       "- name: ''\n",
-       "- flags: 2\n",
-       "- metadata: NULL\n",
-       "- dictionary: NULL\n",
-       "- children[0]:"
+       "<Schema> int32"
       ]
      },
-     "execution_count": 2,
+     "execution_count": 47,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "import pyarrow as pa\n",
-    "schema = na.c_schema(pa.decimal128(10, 3))\n",
-    "schema"
+    "na.int32()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "nanoarrow.Array<int32>[3]\n",
+       "1\n",
+       "2\n",
+       "3"
+      ]
+     },
+     "execution_count": 48,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "na.Array([1, 2, 3], na.int32())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `nanoarrow.Array` can accommodate arrays with any number of chunks, 
reflecting the reality that many array containers (e.g., 
`pyarrow.ChunkedArray`, `polars.Series`) support this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "nanoarrow.Array<int32>[6]\n",
+       "1\n",
+       "2\n",
+       "3\n",
+       "4\n",
+       "5\n",
+       "6"
+      ]
+     },
+     "execution_count": 49,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n",
+    "chunked"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can extract the fields of a `CSchema` object one at a time or parse 
it into a view to extract deserialized parameters."
+    "Whereas chunks of an `Array` are always fully materialized when the 
object is constructed, the chunks of an `ArrayStream` have not necessarily been 
resolved yet."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "nanoarrow.ArrayStream<int32>"
+      ]
+     },
+     "execution_count": 50,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "stream = na.ArrayStream(chunked)\n",
+    "stream"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "nanoarrow.Array<int32>[3]\n",
+      "1\n",
+      "2\n",
+      "3\n",
+      "nanoarrow.Array<int32>[3]\n",
+      "4\n",
+      "5\n",
+      "6\n"
+     ]
+    }
+   ],
+   "source": [
+    "with stream:\n",
+    "    for chunk in stream:\n",
+    "        print(chunk)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's 
[Arrow 
IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
 reader:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "nanoarrow.ArrayStream<non-nullable struct<commit: string, time: 
timestamp('us', 'UTC'), files: int3...>"
+      ]
+     },
+     "execution_count": 52,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "url = 
\"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n";,
+    "na.ArrayStream.from_url(url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These objects implement the [Arrow PyCapsule 
interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
 for both producing and consuming and are interchangeable with `pyarrow` 
objects in many cases:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "pyarrow.Field<: int32>"
+      ]
+     },
+     "execution_count": 53,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pyarrow as pa\n",
+    "\n",
+    "pa.field(na.int32())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n",
+       "[\n",
+       "  [\n",
+       "    1,\n",
+       "    2,\n",
+       "    3\n",
+       "  ],\n",
+       "  [\n",
+       "    4,\n",
+       "    5,\n",
+       "    6\n",
+       "  ]\n",
+       "]"
+      ]
+     },
+     "execution_count": 54,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pa.chunked_array(chunked)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<pyarrow.lib.Int32Array object at 0x11b552500>\n",
+       "[\n",
+       "  4,\n",
+       "  5,\n",
+       "  6\n",
+       "]"
+      ]
+     },
+     "execution_count": 55,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pa.array(chunked.chunk(1))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 56,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "nanoarrow.Array<int64>[3]\n",
+       "10\n",
+       "11\n",
+       "12"
+      ]
+     },
+     "execution_count": 56,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "na.Array(pa.array([10, 11, 12]))"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 57,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<nanoarrow.c_lib.CSchemaView>\n",
-       "- type: 'decimal128'\n",
-       "- storage_type: 'decimal128'\n",
-       "- decimal_bitwidth: 128\n",
-       "- decimal_precision: 10\n",
-       "- decimal_scale: 3\n",
-       "- dictionary_ordered: False\n",
-       "- map_keys_sorted: False\n",
-       "- nullable: True\n",
-       "- storage_type_id: 24\n",
-       "- type_id: 24"
+       "<Schema> string"
       ]
      },
-     "execution_count": 3,
+     "execution_count": 57,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "na.c_schema_view(schema)"
+    "na.Schema(pa.string())"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Advanced users can allocate an empty `CSchema` and populate its contents 
by passing its `._addr()` to a schema-exporting function."
+    "## Low-level C library bindings\n",
+    "\n",
+    "The nanoarrow Python package also provides lower level wrappers around 
Arrow C interface structures. You can create these using 
`nanoarrow.c_schema()`, `nanoarrow.c_array()`, and 
`nanoarrow.c_array_stream()`.\n",
+    "\n",
+    "### Schemas\n",
+    "\n",
+    "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and 
wrap it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 58,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<nanoarrow.c_lib.CSchema int32>\n",
-       "- format: 'i'\n",
+       "<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n",
+       "- format: 'd:10,3'\n",
        "- name: ''\n",
        "- flags: 2\n",
        "- metadata: NULL\n",
@@ -159,15 +391,41 @@
        "- children[0]:"
       ]
      },
-     "execution_count": 4,
+     "execution_count": 58,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "na.c_schema(pa.decimal128(10, 3))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using `c_schema()` is a good fit for testing and for ephemeral schema 
objects that are being passed from one library to another. To extract the 
fields of a schema in a more convenient form, use `Schema()`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(10, 3)"
+      ]
+     },
+     "execution_count": 59,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "schema = na.allocate_c_schema()\n",
-    "pa.int32()._export_to_c(schema._addr())\n",
-    "schema"
+    "schema = na.Schema(pa.decimal128(10, 3))\n",
+    "schema.precision, schema.scale"
    ]
   },
   {
@@ -190,29 +448,28 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 60,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<nanoarrow.c_lib.CArray string>\n",
+       "<nanoarrow.c_array.CArray string>\n",
        "- length: 4\n",
        "- offset: 0\n",
        "- null_count: 1\n",
-       "- buffers: (3678035706048, 3678035705984, 3678035706112)\n",
+       "- buffers: (4754305168, 4754307808, 4754310464)\n",
        "- dictionary: NULL\n",
        "- children[0]:"
       ]
      },
-     "execution_count": 5,
+     "execution_count": 60,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "array = na.c_array(pa.array([\"one\", \"two\", \"three\", None]))\n",
-    "array"
+    "na.c_array([\"one\", \"two\", \"three\", None], na.string())"
    ]
   },
   {
@@ -220,67 +477,87 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can extract the fields of a `CArray` one at a time or parse it into a 
view to extract deserialized content:"
+    "Using `c_array()` is a good fit for testing and for ephemeral array 
objects that are being passed from one library to another. For a higher level 
interface, use `Array()`:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 61,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<nanoarrow.c_lib.CArrayView>\n",
-       "- storage_type: 'string'\n",
-       "- length: 4\n",
-       "- offset: 0\n",
-       "- null_count: 1\n",
-       "- buffers[3]:\n",
-       "  - validity <bool[1 b] 11100000>\n",
-       "  - data_offset <int32[20 b] 0 3 6 11 11>\n",
-       "  - data <string[11 b] b'onetwothree'>\n",
-       "- dictionary: NULL\n",
-       "- children[0]:"
+       "['one', 'two', 'three', None]"
       ]
      },
-     "execution_count": 6,
+     "execution_count": 61,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "na.c_array_view(array)"
+    "array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n",
+    "array.to_pylist()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n",
+       " nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n",
+       " nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))"
+      ]
+     },
+     "execution_count": 62,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "array.buffers"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Like the `CSchema`, you can allocate an empty one and access its address 
with `_addr()` to pass to other array-exporting functions."
+    "Advanced users can create arrays directly from buffers using 
`c_array_from_buffers()`:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 63,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "3"
+       "<nanoarrow.c_array.CArray string>\n",
+       "- length: 2\n",
+       "- offset: 0\n",
+       "- null_count: 0\n",
+       "- buffers: (0, 5002908320, 4999694624)\n",
+       "- dictionary: NULL\n",
+       "- children[0]:"
       ]
      },
-     "execution_count": 7,
+     "execution_count": 63,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "array = na.allocate_c_array()\n",
-    "pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())\n",
-    "array.length"
+    "na.c_array_from_buffers(\n",
+    "    na.string(),\n",
+    "    2,\n",
+    "    [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n",
+    ")"
    ]
   },
   {
@@ -290,30 +567,29 @@
    "source": [
     "### Array streams\n",
     "\n",
-    "You can use `nanoarrow.c_array_stream()` to wrap an object representing a 
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.RecordBatchReader`)."
+    "You can use `nanoarrow.c_array_stream()` to wrap an object representing a 
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 64,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<nanoarrow.c_lib.CArrayStream>\n",
-       "- get_schema(): struct<some_column: int32>"
+       "<nanoarrow.c_array_stream.CArrayStream>\n",
+       "- get_schema(): struct<col1: int64>"
       ]
      },
-     "execution_count": 8,
+     "execution_count": 64,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "pa_array_child = pa.array([1, 2, 3], pa.int32())\n",
-    "pa_array = pa.record_batch([pa_array_child], names=[\"some_column\"])\n",
-    "reader = pa.RecordBatchReader.from_batches(pa_array.schema, 
[pa_array])\n",
+    "pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n",
+    "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, 
[pa_batch])\n",
     "array_stream = na.c_array_stream(reader)\n",
     "array_stream"
    ]
@@ -328,25 +604,25 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 65,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "<nanoarrow.c_lib.CArray struct<some_column: int32>>\n",
+      "<nanoarrow.c_array.CArray struct<col1: int64>>\n",
       "- length: 3\n",
       "- offset: 0\n",
       "- null_count: 0\n",
       "- buffers: (0,)\n",
       "- dictionary: NULL\n",
       "- children[1]:\n",
-      "  'some_column': <nanoarrow.c_lib.CArray int32>\n",
+      "  'col1': <nanoarrow.c_array.CArray int64>\n",
       "    - length: 3\n",
       "    - offset: 0\n",
       "    - null_count: 0\n",
-      "    - buffers: (0, 3678035837056)\n",
+      "    - buffers: (0, 2642948588352)\n",
       "    - dictionary: NULL\n",
       "    - children[0]:\n"
      ]
@@ -358,34 +634,34 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can also get the address of a freshly-allocated stream to pass to a 
suitable exporting function:"
+    "Use `ArrayStream()` for a higher level interface:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 66,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<nanoarrow.c_lib.CArrayStream>\n",
-       "- get_schema(): struct<some_column: int32>"
+       "nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n",
+       "{'col1': 1}\n",
+       "{'col1': 2}\n",
+       "{'col1': 3}"
       ]
      },
-     "execution_count": 10,
+     "execution_count": 66,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "array_stream = na.allocate_c_array_stream()\n",
-    "reader._export_to_c(array_stream._addr())\n",
-    "array_stream"
+    "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, 
[pa_batch])\n",
+    "na.ArrayStream(reader).read_all()"
    ]
   },
   {
@@ -408,11 +684,13 @@
     "\n",
     "```shell\n",
     "# Install dependencies\n",
-    "pip install -e .[test]\n",
+    "pip install -e \".[test]\"\n",
     "\n",
     "# Run tests\n",
     "pytest -vvx\n",
-    "```"
+    "```\n",
+    "\n",
+    "CMake is currently required to ensure that the vendored copy of nanoarrow 
in the Python package stays in sync with the nanoarrow sources in the working 
tree."
    ]
   }
  ],
@@ -432,7 +710,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.12.3"
   },
   "orig_nbformat": 4
  },
diff --git a/python/README.md b/python/README.md
index 42b4e390..f279a095 100644
--- a/python/README.md
+++ b/python/README.md
@@ -29,11 +29,19 @@ interfaces.
 
 ## Installation
 
-Python bindings for nanoarrow are not yet available on PyPI. You can install 
via
-URL (requires a C compiler):
+The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and
+[conda-forge](https://conda-forge.org/):
 
-```bash
-python -m pip install 
"git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python";
+```shell
+pip install nanoarrow
+conda install nanoarrow -c conda-forge
+```
+
+Development versions (based on the `main` branch) are also available:
+
+```shell
+pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
+    --prefer-binary --pre nanoarrow
 ```
 
 If you can import the namespace, you're good to go!
@@ -43,72 +51,207 @@ If you can import the namespace, you're good to go!
 import nanoarrow as na
 ```
 
-## Low-level C library bindings
+## Data types, arrays, and array streams
 
-The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`.
+The Arrow C Data and Arrow C Stream interfaces are comprised of three 
structures: the `ArrowSchema` which represents a data type of an array, the 
`ArrowArray` which represents the values of an array, and an 
`ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common 
`ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, 
and `nanoarrow.ArrayStream` in the Python package.
+
+
+```python
+na.int32()
+```
+
+
+
+
+    <Schema> int32
 
-### Schemas
 
-Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).
+
+
+```python
+na.Array([1, 2, 3], na.int32())
+```
+
+
+
+
+    nanoarrow.Array<int32>[3]
+    1
+    2
+    3
+
+
+
+The `nanoarrow.Array` can accommodate arrays with any number of chunks, 
reflecting the reality that many array containers (e.g., 
`pyarrow.ChunkedArray`, `polars.Series`) support this.
+
+
+```python
+chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
+chunked
+```
+
+
+
+
+    nanoarrow.Array<int32>[6]
+    1
+    2
+    3
+    4
+    5
+    6
+
+
+
+Whereas chunks of an `Array` are always fully materialized when the object is 
constructed, the chunks of an `ArrayStream` have not necessarily been resolved 
yet.
+
+
+```python
+stream = na.ArrayStream(chunked)
+stream
+```
+
+
+
+
+    nanoarrow.ArrayStream<int32>
+
+
+
+
+```python
+with stream:
+    for chunk in stream:
+        print(chunk)
+```
+
+    nanoarrow.Array<int32>[3]
+    1
+    2
+    3
+    nanoarrow.Array<int32>[3]
+    4
+    5
+    6
+
+
+The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow 
IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
 reader:
+
+
+```python
+url = 
"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows";
+na.ArrayStream.from_url(url)
+```
+
+
+
+
+    nanoarrow.ArrayStream<non-nullable struct<commit: string, time: 
timestamp('us', 'UTC'), files: int3...>
+
+
+
+These objects implement the [Arrow PyCapsule 
interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
 for both producing and consuming and are interchangeable with `pyarrow` 
objects in many cases:
 
 
 ```python
 import pyarrow as pa
-schema = na.c_schema(pa.decimal128(10, 3))
-schema
+
+pa.field(na.int32())
 ```
 
 
 
 
-    <nanoarrow.c_lib.CSchema decimal128(10, 3)>
-    - format: 'd:10,3'
-    - name: ''
-    - flags: 2
-    - metadata: NULL
-    - dictionary: NULL
-    - children[0]:
+    pyarrow.Field<: int32>
+
+
+
+
+```python
+pa.chunked_array(chunked)
+```
+
+
+
+
+    <pyarrow.lib.ChunkedArray object at 0x12a49a250>
+    [
+      [
+        1,
+        2,
+        3
+      ],
+      [
+        4,
+        5,
+        6
+      ]
+    ]
+
+
+
+
+```python
+pa.array(chunked.chunk(1))
+```
+
+
+
+
+    <pyarrow.lib.Int32Array object at 0x11b552500>
+    [
+      4,
+      5,
+      6
+    ]
+
+
+
+
+```python
+na.Array(pa.array([10, 11, 12]))
+```
+
+
+
 
+    nanoarrow.Array<int64>[3]
+    10
+    11
+    12
 
 
-You can extract the fields of a `CSchema` object one at a time or parse it 
into a view to extract deserialized parameters.
 
 
 ```python
-na.c_schema_view(schema)
+na.Schema(pa.string())
 ```
 
 
 
 
-    <nanoarrow.c_lib.CSchemaView>
-    - type: 'decimal128'
-    - storage_type: 'decimal128'
-    - decimal_bitwidth: 128
-    - decimal_precision: 10
-    - decimal_scale: 3
-    - dictionary_ordered: False
-    - map_keys_sorted: False
-    - nullable: True
-    - storage_type_id: 24
-    - type_id: 24
+    <Schema> string
+
+
+
+## Low-level C library bindings
 
+The nanoarrow Python package also provides lower level wrappers around Arrow C 
interface structures. You can create these using `nanoarrow.c_schema()`, 
`nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.
 
+### Schemas
 
-Advanced users can allocate an empty `CSchema` and populate its contents by 
passing its `._addr()` to a schema-exporting function.
+Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).
 
 
 ```python
-schema = na.allocate_c_schema()
-pa.int32()._export_to_c(schema._addr())
-schema
+na.c_schema(pa.decimal128(10, 3))
 ```
 
 
 
 
-    <nanoarrow.c_lib.CSchema int32>
-    - format: 'i'
+    <nanoarrow.c_schema.CSchema decimal128(10, 3)>
+    - format: 'd:10,3'
     - name: ''
     - flags: 2
     - metadata: NULL
@@ -117,6 +260,21 @@ schema
 
 
 
+Using `c_schema()` is a good fit for testing and for ephemeral schema objects 
that are being passed from one library to another. To extract the fields of a 
schema in a more convenient form, use `Schema()`:
+
+
+```python
+schema = na.Schema(pa.decimal128(10, 3))
+schema.precision, schema.scale
+```
+
+
+
+
+    (10, 3)
+
+
+
 The `CSchema` object cleans up after itself: when the object is deleted, the 
underlying `ArrowSchema` is released.
 
 ### Arrays
@@ -125,72 +283,83 @@ You can use `nanoarrow.c_array()` to convert an 
array-like object to an `ArrowAr
 
 
 ```python
-array = na.c_array(pa.array(["one", "two", "three", None]))
-array
+na.c_array(["one", "two", "three", None], na.string())
 ```
 
 
 
 
-    <nanoarrow.c_lib.CArray string>
+    <nanoarrow.c_array.CArray string>
     - length: 4
     - offset: 0
     - null_count: 1
-    - buffers: (3678035706048, 3678035705984, 3678035706112)
+    - buffers: (4754305168, 4754307808, 4754310464)
     - dictionary: NULL
     - children[0]:
 
 
 
-You can extract the fields of a `CArray` one at a time or parse it into a view 
to extract deserialized content:
+Using `c_array()` is a good fit for testing and for ephemeral array objects 
that are being passed from one library to another. For a higher level 
interface, use `Array()`:
 
 
 ```python
-na.c_array_view(array)
+array = na.Array(["one", "two", "three", None], na.string())
+array.to_pylist()
 ```
 
 
 
 
-    <nanoarrow.c_lib.CArrayView>
-    - storage_type: 'string'
-    - length: 4
-    - offset: 0
-    - null_count: 1
-    - buffers[3]:
-      - validity <bool[1 b] 11100000>
-      - data_offset <int32[20 b] 0 3 6 11 11>
-      - data <string[11 b] b'onetwothree'>
-    - dictionary: NULL
-    - children[0]:
+    ['one', 'two', 'three', None]
+
+
+
+
+```python
+array.buffers
+```
+
+
 
 
+    (nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
+     nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
+     nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))
 
-Like the `CSchema`, you can allocate an empty one and access its address with 
`_addr()` to pass to other array-exporting functions.
+
+
+Advanced users can create arrays directly from buffers using 
`c_array_from_buffers()`:
 
 
 ```python
-array = na.allocate_c_array()
-pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())
-array.length
+na.c_array_from_buffers(
+    na.string(),
+    2,
+    [None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
+)
 ```
 
 
 
 
-    3
+    <nanoarrow.c_array.CArray string>
+    - length: 2
+    - offset: 0
+    - null_count: 0
+    - buffers: (0, 5002908320, 4999694624)
+    - dictionary: NULL
+    - children[0]:
 
 
 
 ### Array streams
 
-You can use `nanoarrow.c_array_stream()` to wrap an object representing a 
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.RecordBatchReader`).
+You can use `nanoarrow.c_array_stream()` to wrap an object representing a 
sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap 
it as a Python object. This works for any object implementing the [Arrow 
PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) 
(e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`).
 
 
 ```python
-pa_array_child = pa.array([1, 2, 3], pa.int32())
-pa_array = pa.record_batch([pa_array_child], names=["some_column"])
-reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
+pa_batch = pa.record_batch({"col1": [1, 2, 3]})
+reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
 array_stream = na.c_array_stream(reader)
 array_stream
 ```
@@ -198,8 +367,8 @@ array_stream
 
 
 
-    <nanoarrow.c_lib.CArrayStream>
-    - get_schema(): struct<some_column: int32>
+    <nanoarrow.c_array_stream.CArrayStream>
+    - get_schema(): struct<col1: int64>
 
 
 
@@ -211,36 +380,37 @@ for array in array_stream:
     print(array)
 ```
 
-    <nanoarrow.c_lib.CArray struct<some_column: int32>>
+    <nanoarrow.c_array.CArray struct<col1: int64>>
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers: (0,)
     - dictionary: NULL
     - children[1]:
-      'some_column': <nanoarrow.c_lib.CArray int32>
+      'col1': <nanoarrow.c_array.CArray int64>
         - length: 3
         - offset: 0
         - null_count: 0
-        - buffers: (0, 3678035837056)
+        - buffers: (0, 2642948588352)
         - dictionary: NULL
         - children[0]:
 
 
-You can also get the address of a freshly-allocated stream to pass to a 
suitable exporting function:
+Use `ArrayStream()` for a higher level interface:
 
 
 ```python
-array_stream = na.allocate_c_array_stream()
-reader._export_to_c(array_stream._addr())
-array_stream
+reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
+na.ArrayStream(reader).read_all()
 ```
 
 
 
 
-    <nanoarrow.c_lib.CArrayStream>
-    - get_schema(): struct<some_column: int32>
+    nanoarrow.Array<non-nullable struct<col1: int64>>[3]
+    {'col1': 1}
+    {'col1': 2}
+    {'col1': 3}
 
 
 
@@ -264,3 +434,5 @@ pip install -e ".[test]"
 # Run tests
 pytest -vvx
 ```
+
+CMake is currently required to ensure that the vendored copy of nanoarrow in 
the Python package stays in sync with the nanoarrow sources in the working tree.

(arrow-nanoarrow) branch main updated: docs(python): Update Python bindings readme (#474)

Reply via email to