This is an automated email from the ASF dual-hosted git repository.
jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new ed8c3630db GH-40841: [Docs][C++][Python] Add initial documentation for
RecordBatch::Tensor conversion (#40842)
ed8c3630db is described below
commit ed8c3630dbe2261bed9123a4ccfc7df0e3f031bd
Author: Alenka Frim <[email protected]>
AuthorDate: Fri Mar 29 08:29:28 2024 +0100
GH-40841: [Docs][C++][Python] Add initial documentation for
RecordBatch::Tensor conversion (#40842)
### Rationale for this change
The work on the conversion from `Table`/`RecordBatch` to `Tensor` is
progressing and we have to make sure to add information to the documentation.
### What changes are included in this PR?
I propose to add
- new page (`converting_recordbatch_to_tensor.rst`) in the `cpp/examples`
section,
- added section (Conversion of RecordBatch do Tensor) in the
`docs/source/python/data.rst`
the content above would be updated as the features are added in the future
(row-major conversion, `Table::ToTensor`, DLPack support for `Tensor` class,
etc.)
### Are these changes tested?
It will be tested with the crossbow preview-docs job.
### Are there any user-facing changes?
No, just documentation.
* GitHub Issue: #40841
Lead-authored-by: AlenkaF <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
---
.../examples/converting_recordbatch_to_tensor.rst | 46 +++++++++++++++++++
docs/source/cpp/examples/index.rst | 1 +
docs/source/python/data.rst | 52 ++++++++++++++++++++++
3 files changed, 99 insertions(+)
diff --git a/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst
b/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst
new file mode 100644
index 0000000000..2be27096cf
--- /dev/null
+++ b/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst
@@ -0,0 +1,46 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+
+Conversion of ``RecordBatch`` to ``Tensor`` instances
+=====================================================
+
+Arrow provides a method to convert ``RecordBatch`` objects to a ``Tensor``
+with two dimensions:
+
+.. code::
+
+ std::shared_ptr<RecordBatch> batch;
+
+ ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor());
+ ASSERT_OK(tensor->Validate());
+
+The conversion supports signed and unsigned integer types plus float types.
+In case the ``RecordBatch`` has null values the conversion succeeds if
+``null_to_nan`` parameter is set to ``true``. In this case all
+types will be promoted to a floating-point data type.
+
+.. code::
+
+ std::shared_ptr<RecordBatch> batch;
+
+ ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor(/*null_to_nan=*/true));
+ ASSERT_OK(tensor->Validate());
+
+Currently only column-major conversion is supported.
diff --git a/docs/source/cpp/examples/index.rst
b/docs/source/cpp/examples/index.rst
index b886a0d29e..90b00bbdf6 100644
--- a/docs/source/cpp/examples/index.rst
+++ b/docs/source/cpp/examples/index.rst
@@ -27,3 +27,4 @@ Examples
dataset_skyhook_scan_example
row_columnar_conversion
std::tuple-like ranges to Arrow <tuple_range_conversion>
+ Converting RecordBatch to Tensor <converting_recordbatch_to_tensor>
diff --git a/docs/source/python/data.rst b/docs/source/python/data.rst
index 2cc33561d4..9156157fcd 100644
--- a/docs/source/python/data.rst
+++ b/docs/source/python/data.rst
@@ -560,3 +560,55 @@ schema without having to get any of the batches.::
x: int64
It can also be sent between languages using the :ref:`C stream interface
<c-stream-interface>`.
+
+Conversion of RecordBatch do Tensor
+-----------------------------------
+
+Each array of the ``RecordBatch`` has it's own contiguous memory that is not
necessarily
+adjacent to other arrays. A different memory structure that is used in machine
learning
+libraries is a two dimensional array (also called a 2-dim tensor or a matrix)
which takes
+only one contiguous block of memory.
+
+For this reason there is a function ``pyarrow.RecordBatch.to_tensor()``
available
+to efficiently convert tabular columnar data into a tensor.
+
+Data types supported in this conversion are unsigned, signed integer and float
+types. Currently only column-major conversion is supported.
+
+ >>> import pyarrow as pa
+ >>> arr1 = [1, 2, 3, 4, 5]
+ >>> arr2 = [10, 20, 30, 40, 50]
+ >>> batch = pa.RecordBatch.from_arrays(
+ ... [
+ ... pa.array(arr1, type=pa.uint16()),
+ ... pa.array(arr2, type=pa.int16()),
+ ... ], ["a", "b"]
+ ... )
+ >>> batch.to_tensor()
+ <pyarrow.Tensor>
+ type: int32
+ shape: (9, 2)
+ strides: (4, 36)
+ >>> batch.to_tensor().to_numpy()
+ array([[ 1, 10],
+ [ 2, 20],
+ [ 3, 30],
+ [ 4, 40],
+ [ 5, 50]], dtype=int32)
+
+With ``null_to_nan`` set to ``True`` one can also convert data with
+nulls. They will be converted to ``NaN``:
+
+ >>> import pyarrow as pa
+ >>> batch = pa.record_batch(
+ ... [
+ ... pa.array([1, 2, 3, 4, None], type=pa.int32()),
+ ... pa.array([10, 20, 30, 40, None], type=pa.float32()),
+ ... ], names = ["a", "b"]
+ ... )
+ >>> batch.to_tensor(null_to_nan=True).to_numpy()
+ array([[ 1., 10.],
+ [ 2., 20.],
+ [ 3., 30.],
+ [ 4., 40.],
+ [nan, nan]])