Re: [PR] GH-40062: [C++][Python] Conversion of Table to Arrow Tensor [arrow]

via GitHub Thu, 06 Jun 2024 02:39:43 -0700


jorisvandenbossche commented on code in PR #41870:
URL: https://github.com/apache/arrow/pull/41870#discussion_r1629112207



##########
cpp/src/arrow/tensor.cc:
##########
@@ -269,13 +270,15 @@ struct ConvertColumnsToTensorRowMajorVisitor {
       auto in_values = ArraySpan(in_data).GetSpan<In>(1, in_data.length);
 
       if (in_data.null_count == 0) {
-        for (int64_t i = 0; i < in_data.length; ++i) {
-          out_values[i * num_cols + col_idx] = static_cast<Out>(in_values[i]);
+        for (int64_t data_idx = 0; data_idx < in_data.length; ++data_idx) {

Review Comment:
   It might be fine to keep `i` here as the loop variable, since it is a short 
loop and clear where the value is coming from (i.e. iterating over length of 
`in_data`), that keeps the line a bit shorter



##########
cpp/src/arrow/tensor.cc:
##########
@@ -285,74 +288,78 @@ struct ConvertColumnsToTensorRowMajorVisitor {
 };
 
 template <typename DataType>
-inline void ConvertColumnsToTensor(const RecordBatch& batch, uint8_t* out,
-                                   bool row_major) {
+inline void ConvertColumnsToTensor(const Table& table, uint8_t* out, bool 
row_major) {
   using CType = typename arrow::TypeTraits<DataType>::CType;
   auto* out_values = reinterpret_cast<CType*>(out);
 
   int i = 0;
-  for (const auto& column : batch.columns()) {
-    if (row_major) {
-      ConvertColumnsToTensorRowMajorVisitor<CType> visitor{out_values, 
*column->data(),
-                                                           
batch.num_columns(), i++};
-      DCHECK_OK(VisitTypeInline(*column->type(), &visitor));
-    } else {
-      ConvertColumnsToTensorVisitor<CType> visitor{out_values, 
*column->data()};
-      DCHECK_OK(VisitTypeInline(*column->type(), &visitor));
+  for (const auto& column : table.columns()) {
+    int j = 0;
+    for (const auto& chunk : column->chunks()) {
+      if (row_major) {
+        ConvertArrayToTensorRowMajorVisitor<CType> visitor{out_values, 
*chunk->data(),
+                                                           
table.num_columns(), i, j};

Review Comment:
   While here we can maybe use the more descriptive `col_idx` and `chunk_idx` 
(the names that are also used in `ConvertArrayToTensorRowMajorVisitor`)



##########
python/pyarrow/table.pxi:
##########
@@ -4873,6 +4873,85 @@ cdef class Table(_Tabular):
 
         return result
 
+    def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True, 
MemoryPool memory_pool=None):
+        """
+        Convert to a :class:`~pyarrow.Tensor`.
+
+        Tables that can be converted have fields of type signed or unsigned 
integer or float,
+        including all bit-widths.
+
+        ``null_to_nan`` is ``False`` by default and this method will raise an 
error in case
+        any nulls are present. Tables with nulls can be converted with 
``null_to_nan`` set to
+        ``True``. In this case null values are converted to ``NaN`` and 
integer type arrays are
+        promoted to the appropriate float type.
+
+        Parameters
+        ----------
+        null_to_nan : bool, default False
+            Whether to write null values in the result as ``NaN``.
+        row_major : bool, default True
+            Whether resulting Tensor is row-major or column-major
+        memory_pool : MemoryPool, default None
+            For memory allocations, if required, otherwise use default pool
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> table = pa.table(
+        ...    [
+        ...       pa.chunked_array([[1, 2], [3, 4, None]], type=pa.int32()),
+        ...       pa.chunked_array([[10, 20, 30], [40, None]], 
type=pa.float32()),
+        ...    ], names = ["a", "b"]
+        ... )
+
+        >>> table
+        pyarrow.Table
+        a: int32
+        b: float
+        ----
+        a: [[1,2],[3,4,null]]
+        b: [[10,20,30],[40,null]]
+
+        Convert a Table to row-major Tensor with null values written as 
``NaN``s:
+
+        >>> table.to_tensor(null_to_nan=True)
+        <pyarrow.Tensor>
+        type: double
+        shape: (5, 2)
+        strides: (16, 8)
+        >>> table.to_tensor(null_to_nan=True).to_numpy()
+        array([[ 1., 10.],
+               [ 2., 20.],
+               [ 3., 30.],
+               [ 4., 40.],
+               [nan, nan]])
+
+        Convert a Table to column-major Tensor
+
+        >>> table.to_tensor(null_to_nan=True, row_major=False)
+        <pyarrow.Tensor>
+        type: double
+        shape: (5, 2)
+        strides: (8, 40)
+        >>> table.to_tensor(null_to_nan=True, row_major=False).to_numpy()
+        array([[ 1., 10.],
+               [ 2., 20.],
+               [ 3., 30.],
+               [ 4., 40.],
+               [nan, nan]])
+        """
+        cdef:
+            shared_ptr[CTable] c_table
+            shared_ptr[CTensor] c_tensor
+            CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)
+
+        c_table = pyarrow_unwrap_table(self)

Review Comment:
   I know you copied this from RecordBatch, but seeing this now: in the class 
itself, you can actually use `self.table` or `self.sp_table` instead of calling 
`pyarrow_unwrap_table` to get the CTable. 
   
   Then you also don't need to declare `c_table`, but I think it should work to 
directly do something like `GetResultValue(self.table.ToTensor(...))`



##########
python/pyarrow/tests/test_table.py:
##########
@@ -1219,6 +1219,295 @@ def test_recordbatch_to_tensor_unsupported():
         batch.to_tensor()
 
 
[email protected]('typ', [

Review Comment:
   We are adding a lot of code to this test file, while in practice it's not 
really testing anything new. I am wondering if we could either parametrize some 
existing tests for RecordBatch to run with both RecordBatch and Table (we have 
some other tests in this file that are parametrized that way, see eg 
`test_table_basics`), or otherwise only test some unique aspects of 
Table.to_tensor (the fact that the method exists and works, a case with 
multiple chunks), but for the other aspects (type promotion, null handling, 
etc) rely on the record batch tests.



##########
cpp/src/arrow/table.h:
##########
@@ -101,6 +101,19 @@ class ARROW_EXPORT Table {
   static Result<std::shared_ptr<Table>> FromChunkedStructArray(
       const std::shared_ptr<ChunkedArray>& array);
 
+  /// \brief Convert table with one data type to Tensor

Review Comment:
   This "with one data type" is probably a left-over from the first version 
(when only that was supported)?
   
   (same for the record batch doc)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40062: [C++][Python] Conversion of Table to Arrow Tensor [arrow]

Reply via email to