(arrow) branch main updated: GH-40841: [Docs][C++][Python] Add initial documentation for RecordBatch::Tensor conversion (#40842)

jorisvandenbossche Fri, 29 Mar 2024 00:29:41 -0700

This is an automated email from the ASF dual-hosted git repository.

jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/main by this push:
     new ed8c3630db GH-40841: [Docs][C++][Python] Add initial documentation for 
RecordBatch::Tensor conversion (#40842)
ed8c3630db is described below

commit ed8c3630dbe2261bed9123a4ccfc7df0e3f031bd
Author: Alenka Frim <[email protected]>
AuthorDate: Fri Mar 29 08:29:28 2024 +0100

    GH-40841: [Docs][C++][Python] Add initial documentation for 
RecordBatch::Tensor conversion (#40842)
    
    ### Rationale for this change
    
    The work on the conversion from `Table`/`RecordBatch` to `Tensor` is 
progressing and we have to make sure to add information to the documentation.
    
    ### What changes are included in this PR?
    
    I propose to add
    
    - new page (`converting_recordbatch_to_tensor.rst`) in the `cpp/examples` 
section,
    - added section (Conversion of RecordBatch do Tensor) in the 
`docs/source/python/data.rst`
    
    the content above would be updated as the features are added in the future 
(row-major conversion, `Table::ToTensor`, DLPack support for `Tensor` class, 
etc.)
    
    ### Are these changes tested?
    
    It will be tested with the crossbow preview-docs job.
    
    ### Are there any user-facing changes?
    
    No, just documentation.
    * GitHub Issue: #40841
    
    Lead-authored-by: AlenkaF <[email protected]>
    Co-authored-by: Alenka Frim <[email protected]>
    Co-authored-by: Joris Van den Bossche <[email protected]>
    Signed-off-by: Joris Van den Bossche <[email protected]>
---
 .../examples/converting_recordbatch_to_tensor.rst  | 46 +++++++++++++++++++
 docs/source/cpp/examples/index.rst                 |  1 +
 docs/source/python/data.rst                        | 52 ++++++++++++++++++++++
 3 files changed, 99 insertions(+)

diff --git a/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst 
b/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst
new file mode 100644
index 0000000000..2be27096cf
--- /dev/null
+++ b/docs/source/cpp/examples/converting_recordbatch_to_tensor.rst
@@ -0,0 +1,46 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+
+Conversion of ``RecordBatch`` to ``Tensor`` instances
+=====================================================
+
+Arrow provides a method to convert ``RecordBatch`` objects to a ``Tensor``
+with two dimensions:
+
+.. code::
+
+   std::shared_ptr<RecordBatch> batch;
+
+   ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor());
+   ASSERT_OK(tensor->Validate());
+
+The conversion supports signed and unsigned integer types plus float types.
+In case the ``RecordBatch`` has null values the conversion succeeds if
+``null_to_nan`` parameter is set to ``true``. In this case all
+types will be promoted to a floating-point data type.
+
+.. code::
+
+   std::shared_ptr<RecordBatch> batch;
+
+   ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor(/*null_to_nan=*/true));
+   ASSERT_OK(tensor->Validate());
+
+Currently only column-major conversion is supported.
diff --git a/docs/source/cpp/examples/index.rst 
b/docs/source/cpp/examples/index.rst
index b886a0d29e..90b00bbdf6 100644
--- a/docs/source/cpp/examples/index.rst
+++ b/docs/source/cpp/examples/index.rst
@@ -27,3 +27,4 @@ Examples
    dataset_skyhook_scan_example
    row_columnar_conversion
    std::tuple-like ranges to Arrow <tuple_range_conversion>
+   Converting RecordBatch to Tensor <converting_recordbatch_to_tensor>
diff --git a/docs/source/python/data.rst b/docs/source/python/data.rst
index 2cc33561d4..9156157fcd 100644
--- a/docs/source/python/data.rst
+++ b/docs/source/python/data.rst
@@ -560,3 +560,55 @@ schema without having to get any of the batches.::
    x: int64
 
 It can also be sent between languages using the :ref:`C stream interface 
<c-stream-interface>`.
+
+Conversion of RecordBatch do Tensor
+-----------------------------------
+
+Each array of the ``RecordBatch`` has it's own contiguous memory that is not 
necessarily
+adjacent to other arrays. A different memory structure that is used in machine 
learning
+libraries is a two dimensional array (also called a 2-dim tensor or a matrix) 
which takes
+only one contiguous block of memory.
+
+For this reason there is a function ``pyarrow.RecordBatch.to_tensor()`` 
available
+to efficiently convert tabular columnar data into a tensor.
+
+Data types supported in this conversion are unsigned, signed integer and float
+types. Currently only column-major conversion is supported.
+
+   >>>  import pyarrow as pa
+   >>>  arr1 = [1, 2, 3, 4, 5]
+   >>>  arr2 = [10, 20, 30, 40, 50]
+   >>>  batch = pa.RecordBatch.from_arrays(
+   ...      [
+   ...          pa.array(arr1, type=pa.uint16()),
+   ...          pa.array(arr2, type=pa.int16()),
+   ...      ], ["a", "b"]
+   ...  )
+   >>>  batch.to_tensor()
+   <pyarrow.Tensor>
+   type: int32
+   shape: (9, 2)
+   strides: (4, 36)
+   >>>  batch.to_tensor().to_numpy()
+   array([[ 1, 10],
+         [ 2, 20],
+         [ 3, 30],
+         [ 4, 40],
+         [ 5, 50]], dtype=int32)
+
+With ``null_to_nan`` set to ``True`` one can also convert data with
+nulls. They will be converted to ``NaN``:
+
+   >>> import pyarrow as pa
+   >>> batch = pa.record_batch(
+   ...     [
+   ...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
+   ...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
+   ...     ], names = ["a", "b"]
+   ... )
+   >>> batch.to_tensor(null_to_nan=True).to_numpy()
+   array([[ 1., 10.],
+         [ 2., 20.],
+         [ 3., 30.],
+         [ 4., 40.],
+         [nan, nan]])

(arrow) branch main updated: GH-40841: [Docs][C++][Python] Add initial documentation for RecordBatch::Tensor conversion (#40842)

Reply via email to