jorisvandenbossche commented on code in PR #40867:
URL: https://github.com/apache/arrow/pull/40867#discussion_r1555721440
##########
cpp/src/arrow/record_batch.cc:
##########
@@ -283,18 +283,55 @@ struct ConvertColumnsToTensorVisitor {
}
};
+template <typename Out>
+struct ConvertColumnsToTensorRowMajorVisitor {
+ Out*& out_values;
+ const ArrayData& in_data;
+ int num_cols;
+ int col_idx;
+
+ template <typename T>
+ Status Visit(const T&) {
+ if constexpr (is_numeric(T::type_id)) {
+ using In = typename T::c_type;
+ auto in_values = ArraySpan(in_data).GetSpan<In>(1, in_data.length);
+
+ if (in_data.null_count == 0) {
+ for (int64_t j = 0; j < in_data.length; ++j) {
Review Comment:
You can just use `i` given there is other loop variable
##########
python/pyarrow/table.pxi:
##########
@@ -3389,20 +3389,25 @@ cdef class RecordBatch(_Tabular):
<CResult[shared_ptr[CArray]]>deref(c_record_batch).ToStructArray())
return pyarrow_wrap_array(c_array)
- def to_tensor(self, c_bool null_to_nan=False, MemoryPool memory_pool=None):
+ def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True,
MemoryPool memory_pool=None):
"""
Convert to a :class:`~pyarrow.Tensor`.
RecordBatches that can be converted have fields of type signed or
unsigned
- integer or float, including all bit-widths. RecordBatches with
validity bitmask
- for any of the arrays can be converted with ``null_to_nan``turned to
``True``.
- In this case null values are converted to NaN and signed or unsigned
integer
- type arrays are promoted to appropriate float type.
+ integer or float, including all bit-widths.
+
+ ``null_to_nan`` is ``False`` by default and will raise an error in case
+ validity bitmask exists. RecordBatches with validity bitmask for any
of the
Review Comment:
```suggestion
``null_to_nan`` is ``False`` by default and this method will raise
an error in case
any nulls are present. RecordBatches with nulls
```
It doesn't matter if the array has a bitmask allocated (they can be all
valid), just whether there are nulls or not in the actual data.
##########
python/pyarrow/table.pxi:
##########
@@ -3389,20 +3389,25 @@ cdef class RecordBatch(_Tabular):
<CResult[shared_ptr[CArray]]>deref(c_record_batch).ToStructArray())
return pyarrow_wrap_array(c_array)
- def to_tensor(self, c_bool null_to_nan=False, MemoryPool memory_pool=None):
+ def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True,
MemoryPool memory_pool=None):
"""
Convert to a :class:`~pyarrow.Tensor`.
RecordBatches that can be converted have fields of type signed or
unsigned
- integer or float, including all bit-widths. RecordBatches with
validity bitmask
- for any of the arrays can be converted with ``null_to_nan``turned to
``True``.
- In this case null values are converted to NaN and signed or unsigned
integer
- type arrays are promoted to appropriate float type.
+ integer or float, including all bit-widths.
+
+ ``null_to_nan`` is ``False`` by default and will raise an error in case
+ validity bitmask exists. RecordBatches with validity bitmask for any
of the
+ arrays can be converted with ``null_to_nan`` turned to ``True``. In
this case
+ null values are converted to ``NaN`` and signed or unsigned integer
type arrays
+ are promoted to appropriate float type.
Review Comment:
```suggestion
null values are converted to ``NaN`` and integer type arrays
are promoted to the appropriate float type.
```
##########
cpp/src/arrow/record_batch_test.cc:
##########
@@ -756,7 +763,8 @@ TEST_F(TestRecordBatch, ToTensorSupportedNullToNan) {
auto batch = RecordBatch::Make(schema, length, {a0, a1});
- ASSERT_OK_AND_ASSIGN(auto tensor, batch->ToTensor(/*null_to_nan=*/true));
+ ASSERT_OK_AND_ASSIGN(auto tensor,
+ batch->ToTensor(/*null_to_nan=*/true,
/*row_major=*/false));
Review Comment:
We should probably also test the `null_to_nan` option for row major as well?
Would it be possible to test both inside this function? Only when creating
the expected tensor, the values need to be passed in a different order?
##########
python/pyarrow/table.pxi:
##########
@@ -3389,20 +3389,25 @@ cdef class RecordBatch(_Tabular):
<CResult[shared_ptr[CArray]]>deref(c_record_batch).ToStructArray())
return pyarrow_wrap_array(c_array)
- def to_tensor(self, c_bool null_to_nan=False, MemoryPool memory_pool=None):
+ def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True,
MemoryPool memory_pool=None):
"""
Convert to a :class:`~pyarrow.Tensor`.
RecordBatches that can be converted have fields of type signed or
unsigned
- integer or float, including all bit-widths. RecordBatches with
validity bitmask
- for any of the arrays can be converted with ``null_to_nan``turned to
``True``.
- In this case null values are converted to NaN and signed or unsigned
integer
- type arrays are promoted to appropriate float type.
+ integer or float, including all bit-widths.
+
+ ``null_to_nan`` is ``False`` by default and will raise an error in case
+ validity bitmask exists. RecordBatches with validity bitmask for any
of the
+ arrays can be converted with ``null_to_nan`` turned to ``True``. In
this case
Review Comment:
```suggestion
arrays can be converted with ``null_to_nan`` set to ``True``. In
this case
```
##########
python/pyarrow/tests/test_table.py:
##########
@@ -926,36 +926,46 @@ def test_recordbatch_to_tensor_uniform_type(typ):
pa.array(arr3, type=pa.from_numpy_dtype(typ)),
], ["a", "b", "c"]
)
- result = batch.to_tensor()
+ result = batch.to_tensor(row_major=False)
x = np.array([arr1, arr2, arr3], typ).transpose()
expected = pa.Tensor.from_numpy(x)
+ check_tensors(result, expected, pa.from_numpy_dtype(typ), 27)
+ result = batch.to_tensor()
+ x = np.array([arr1, arr2, arr3], typ, order='F').transpose()
+ expected = pa.Tensor.from_numpy(x)
Review Comment:
```suggestion
result = batch.to_tensor(row_major=False)
x = np.column_stack([arr1, arr2, arr3]).astype(typ, order=F")
expected = pa.Tensor.from_numpy(x)
check_tensors(result, expected, pa.from_numpy_dtype(typ), 27)
result = batch.to_tensor()
x = np.column_stack([arr1, arr2, arr3]).astype(typ, order=C")
expected = pa.Tensor.from_numpy(x)
```
This might be a clearer way to create the expected numpy arrays? Then you
don't need to think about how the `np.array(..)` constructor will create the
array (stack as rows), and then we don't have the `order='F').transpose()` to
create row major data.
(and same for some occurrences below)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]