Re: [PR] GH-40866: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor - add support for row-major [arrow]

via GitHub Mon, 08 Apr 2024 06:41:34 -0700


jorisvandenbossche commented on code in PR #40867:
URL: https://github.com/apache/arrow/pull/40867#discussion_r1555721440



##########
cpp/src/arrow/record_batch.cc:
##########
@@ -283,18 +283,55 @@ struct ConvertColumnsToTensorVisitor {
   }
 };
 
+template <typename Out>
+struct ConvertColumnsToTensorRowMajorVisitor {
+  Out*& out_values;
+  const ArrayData& in_data;
+  int num_cols;
+  int col_idx;
+
+  template <typename T>
+  Status Visit(const T&) {
+    if constexpr (is_numeric(T::type_id)) {
+      using In = typename T::c_type;
+      auto in_values = ArraySpan(in_data).GetSpan<In>(1, in_data.length);
+
+      if (in_data.null_count == 0) {
+        for (int64_t j = 0; j < in_data.length; ++j) {

Review Comment:
   You can just use `i` given there is other loop variable 



##########
python/pyarrow/table.pxi:
##########
@@ -3389,20 +3389,25 @@ cdef class RecordBatch(_Tabular):
                 
<CResult[shared_ptr[CArray]]>deref(c_record_batch).ToStructArray())
         return pyarrow_wrap_array(c_array)
 
-    def to_tensor(self, c_bool null_to_nan=False, MemoryPool memory_pool=None):
+    def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True, 
MemoryPool memory_pool=None):
         """
         Convert to a :class:`~pyarrow.Tensor`.
 
         RecordBatches that can be converted have fields of type signed or 
unsigned
-        integer or float, including all bit-widths. RecordBatches with 
validity bitmask
-        for any of the arrays can be converted with ``null_to_nan``turned to 
``True``.
-        In this case null values are converted to NaN and signed or unsigned 
integer
-        type arrays are promoted to appropriate float type.
+        integer or float, including all bit-widths.
+
+        ``null_to_nan`` is ``False`` by default and will raise an error in case
+        validity bitmask exists. RecordBatches with validity bitmask for any 
of the

Review Comment:
   ```suggestion
           ``null_to_nan`` is ``False`` by default and this method will raise 
an error in case
           any nulls are present. RecordBatches with nulls
   ```
   
   It doesn't matter if the array has a bitmask allocated (they can be all 
valid), just whether there are nulls or not in the actual data.



##########
python/pyarrow/table.pxi:
##########
@@ -3389,20 +3389,25 @@ cdef class RecordBatch(_Tabular):
                 
<CResult[shared_ptr[CArray]]>deref(c_record_batch).ToStructArray())
         return pyarrow_wrap_array(c_array)
 
-    def to_tensor(self, c_bool null_to_nan=False, MemoryPool memory_pool=None):
+    def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True, 
MemoryPool memory_pool=None):
         """
         Convert to a :class:`~pyarrow.Tensor`.
 
         RecordBatches that can be converted have fields of type signed or 
unsigned
-        integer or float, including all bit-widths. RecordBatches with 
validity bitmask
-        for any of the arrays can be converted with ``null_to_nan``turned to 
``True``.
-        In this case null values are converted to NaN and signed or unsigned 
integer
-        type arrays are promoted to appropriate float type.
+        integer or float, including all bit-widths.
+
+        ``null_to_nan`` is ``False`` by default and will raise an error in case
+        validity bitmask exists. RecordBatches with validity bitmask for any 
of the
+        arrays can be converted with ``null_to_nan`` turned to ``True``. In 
this case
+        null values are converted to ``NaN`` and signed or unsigned integer 
type arrays
+        are promoted to appropriate float type.

Review Comment:
   ```suggestion
           null values are converted to ``NaN`` and integer type arrays
           are promoted to the appropriate float type.
   ```



##########
cpp/src/arrow/record_batch_test.cc:
##########
@@ -756,7 +763,8 @@ TEST_F(TestRecordBatch, ToTensorSupportedNullToNan) {
 
   auto batch = RecordBatch::Make(schema, length, {a0, a1});
 
-  ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor(/*null_to_nan=*/true));
+  ASSERT_OK_AND_ASSIGN(auto tensor,
+                       batch->ToTensor(/*null_to_nan=*/true, 
/*row_major=*/false));

Review Comment:
   We should probably also test the `null_to_nan` option for row major as well?
   
   Would it be possible to test both inside this function? Only when creating 
the expected tensor, the values need to be passed in a different order?



##########
python/pyarrow/table.pxi:
##########
@@ -3389,20 +3389,25 @@ cdef class RecordBatch(_Tabular):
                 
<CResult[shared_ptr[CArray]]>deref(c_record_batch).ToStructArray())
         return pyarrow_wrap_array(c_array)
 
-    def to_tensor(self, c_bool null_to_nan=False, MemoryPool memory_pool=None):
+    def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True, 
MemoryPool memory_pool=None):
         """
         Convert to a :class:`~pyarrow.Tensor`.
 
         RecordBatches that can be converted have fields of type signed or 
unsigned
-        integer or float, including all bit-widths. RecordBatches with 
validity bitmask
-        for any of the arrays can be converted with ``null_to_nan``turned to 
``True``.
-        In this case null values are converted to NaN and signed or unsigned 
integer
-        type arrays are promoted to appropriate float type.
+        integer or float, including all bit-widths.
+
+        ``null_to_nan`` is ``False`` by default and will raise an error in case
+        validity bitmask exists. RecordBatches with validity bitmask for any 
of the
+        arrays can be converted with ``null_to_nan`` turned to ``True``. In 
this case

Review Comment:
   ```suggestion
           arrays can be converted with ``null_to_nan`` set to ``True``. In 
this case
   ```



##########
python/pyarrow/tests/test_table.py:
##########
@@ -926,36 +926,46 @@ def test_recordbatch_to_tensor_uniform_type(typ):
             pa.array(arr3, type=pa.from_numpy_dtype(typ)),
         ], ["a", "b", "c"]
     )
-    result = batch.to_tensor()
 
+    result = batch.to_tensor(row_major=False)
     x = np.array([arr1, arr2, arr3], typ).transpose()
     expected = pa.Tensor.from_numpy(x)
+    check_tensors(result, expected, pa.from_numpy_dtype(typ), 27)
 
+    result = batch.to_tensor()
+    x = np.array([arr1, arr2, arr3], typ, order='F').transpose()
+    expected = pa.Tensor.from_numpy(x)

Review Comment:
   ```suggestion
       result = batch.to_tensor(row_major=False)
       x = np.column_stack([arr1, arr2, arr3]).astype(typ, order=F")
       expected = pa.Tensor.from_numpy(x)
       check_tensors(result, expected, pa.from_numpy_dtype(typ), 27)
   
       result = batch.to_tensor()
       x = np.column_stack([arr1, arr2, arr3]).astype(typ, order=C")
       expected = pa.Tensor.from_numpy(x)
   ```
   
   This might be a clearer way to create the expected numpy arrays? Then you 
don't need to think about how the `np.array(..)` constructor will create the 
array (stack as rows), and then we don't have the `order='F').transpose()` to 
create row major data.
   
   (and same for some occurrences below)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40866: [C++][Python] Basic conversion of RecordBatch to Arrow Tensor - add support for row-major [arrow]

Reply via email to