jorisvandenbossche commented on code in PR #40842: URL: https://github.com/apache/arrow/pull/40842#discussion_r1543064894
########## docs/source/cpp/examples/converting_recordbatch_to_tensor.rst: ########## @@ -0,0 +1,46 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. default-domain:: cpp +.. highlight:: cpp + +Conversion of ``RecordBatch`` to ``Tensor`` instances +===================================================== + +Arrow provides a method to convert ``RecordBatch`` objects to ``Tensors`` +with two dimensions: + +.. code:: + + std::shared_ptr<RecordBatch> batch; + + ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor()); + ASSERT_OK(tensor->Validate()); + +The conversion supports signed and unsigned integer types plus float types, +all widths included. In case the ``RecordBatch`` has null values the conversion +succeeds if ``null_to_nan`` parameter is set to ``true``. In this case all +types will be promoted to float-point data type. Review Comment: ```suggestion types will be promoted to a floating-point data type. ``` ########## docs/source/python/data.rst: ########## @@ -560,3 +560,59 @@ schema without having to get any of the batches.:: x: int64 It can also be sent between languages using the :ref:`C stream interface <c-stream-interface>`. + +Conversion of RecordBatch do Tensor +----------------------------------- + +Each array of the ``RecordBatch`` has it's own contiguous memory that is not necessarily +adjacent to other arrays. A different memory structure that is used in machine learning +libraries is a two dimensional array (also called a 2-dim tensor or a matrix) which takes +only one contiguous block of memory. + +For this reason there is a function ``pyarrow.RecordBatch.to_tensor()`` available +to efficiently convert tabular columnar data into a matrix. + +Data types supported in this conversion are unsigned, signed integer and float +types of all widths. Currently only column-major conversion is supported. + + >>> import pyarrow as pa + >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] + >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] Review Comment: ```suggestion >>> arr1 = [1, 2, 3, 4, 5] >>> arr2 = [10, 20, 30, 40, 50] ``` Just to keep the vertical screen estate taken by the output a bit shorter, and it is just as illustrative as the longer version I think ########## docs/source/python/data.rst: ########## @@ -560,3 +560,59 @@ schema without having to get any of the batches.:: x: int64 It can also be sent between languages using the :ref:`C stream interface <c-stream-interface>`. + +Conversion of RecordBatch do Tensor +----------------------------------- + +Each array of the ``RecordBatch`` has it's own contiguous memory that is not necessarily +adjacent to other arrays. A different memory structure that is used in machine learning +libraries is a two dimensional array (also called a 2-dim tensor or a matrix) which takes +only one contiguous block of memory. + +For this reason there is a function ``pyarrow.RecordBatch.to_tensor()`` available +to efficiently convert tabular columnar data into a matrix. Review Comment: ```suggestion to efficiently convert tabular columnar data into a tensor. ``` (it's mentioned above this is also called a matrix, but then after that I would try to use as much as possible to the same word for consistency) ########## docs/source/python/data.rst: ########## @@ -560,3 +560,59 @@ schema without having to get any of the batches.:: x: int64 It can also be sent between languages using the :ref:`C stream interface <c-stream-interface>`. + +Conversion of RecordBatch do Tensor +----------------------------------- + +Each array of the ``RecordBatch`` has it's own contiguous memory that is not necessarily +adjacent to other arrays. A different memory structure that is used in machine learning +libraries is a two dimensional array (also called a 2-dim tensor or a matrix) which takes +only one contiguous block of memory. + +For this reason there is a function ``pyarrow.RecordBatch.to_tensor()`` available +to efficiently convert tabular columnar data into a matrix. + +Data types supported in this conversion are unsigned, signed integer and float +types of all widths. Currently only column-major conversion is supported. Review Comment: I think it is sufficient to say that it supports integer and float data types, I think when written generally like that it is clear this is for both signed and unsigned of all widths ########## docs/source/cpp/examples/converting_recordbatch_to_tensor.rst: ########## @@ -0,0 +1,46 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. default-domain:: cpp +.. highlight:: cpp + +Conversion of ``RecordBatch`` to ``Tensor`` instances +===================================================== + +Arrow provides a method to convert ``RecordBatch`` objects to ``Tensors`` Review Comment: ```suggestion Arrow provides a method to convert ``RecordBatch`` objects to a ``Tensor`` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
