jorisvandenbossche commented on code in PR #41870:
URL: https://github.com/apache/arrow/pull/41870#discussion_r1629112207
##########
cpp/src/arrow/tensor.cc:
##########
@@ -269,13 +270,15 @@ struct ConvertColumnsToTensorRowMajorVisitor {
auto in_values = ArraySpan(in_data).GetSpan<In>(1, in_data.length);
if (in_data.null_count == 0) {
- for (int64_t i = 0; i < in_data.length; ++i) {
- out_values[i * num_cols + col_idx] = static_cast<Out>(in_values[i]);
+ for (int64_t data_idx = 0; data_idx < in_data.length; ++data_idx) {
Review Comment:
It might be fine to keep `i` here as the loop variable, since it is a short
loop and clear where the value is coming from (i.e. iterating over length of
`in_data`), that keeps the line a bit shorter
##########
cpp/src/arrow/tensor.cc:
##########
@@ -285,74 +288,78 @@ struct ConvertColumnsToTensorRowMajorVisitor {
};
template <typename DataType>
-inline void ConvertColumnsToTensor(const RecordBatch& batch, uint8_t* out,
- bool row_major) {
+inline void ConvertColumnsToTensor(const Table& table, uint8_t* out, bool
row_major) {
using CType = typename arrow::TypeTraits<DataType>::CType;
auto* out_values = reinterpret_cast<CType*>(out);
int i = 0;
- for (const auto& column : batch.columns()) {
- if (row_major) {
- ConvertColumnsToTensorRowMajorVisitor<CType> visitor{out_values,
*column->data(),
-
batch.num_columns(), i++};
- DCHECK_OK(VisitTypeInline(*column->type(), &visitor));
- } else {
- ConvertColumnsToTensorVisitor<CType> visitor{out_values,
*column->data()};
- DCHECK_OK(VisitTypeInline(*column->type(), &visitor));
+ for (const auto& column : table.columns()) {
+ int j = 0;
+ for (const auto& chunk : column->chunks()) {
+ if (row_major) {
+ ConvertArrayToTensorRowMajorVisitor<CType> visitor{out_values,
*chunk->data(),
+
table.num_columns(), i, j};
Review Comment:
While here we can maybe use the more descriptive `col_idx` and `chunk_idx`
(the names that are also used in `ConvertArrayToTensorRowMajorVisitor`)
##########
python/pyarrow/table.pxi:
##########
@@ -4873,6 +4873,85 @@ cdef class Table(_Tabular):
return result
+ def to_tensor(self, c_bool null_to_nan=False, c_bool row_major=True,
MemoryPool memory_pool=None):
+ """
+ Convert to a :class:`~pyarrow.Tensor`.
+
+ Tables that can be converted have fields of type signed or unsigned
integer or float,
+ including all bit-widths.
+
+ ``null_to_nan`` is ``False`` by default and this method will raise an
error in case
+ any nulls are present. Tables with nulls can be converted with
``null_to_nan`` set to
+ ``True``. In this case null values are converted to ``NaN`` and
integer type arrays are
+ promoted to the appropriate float type.
+
+ Parameters
+ ----------
+ null_to_nan : bool, default False
+ Whether to write null values in the result as ``NaN``.
+ row_major : bool, default True
+ Whether resulting Tensor is row-major or column-major
+ memory_pool : MemoryPool, default None
+ For memory allocations, if required, otherwise use default pool
+
+ Examples
+ --------
+ >>> import pyarrow as pa
+ >>> table = pa.table(
+ ... [
+ ... pa.chunked_array([[1, 2], [3, 4, None]], type=pa.int32()),
+ ... pa.chunked_array([[10, 20, 30], [40, None]],
type=pa.float32()),
+ ... ], names = ["a", "b"]
+ ... )
+
+ >>> table
+ pyarrow.Table
+ a: int32
+ b: float
+ ----
+ a: [[1,2],[3,4,null]]
+ b: [[10,20,30],[40,null]]
+
+ Convert a Table to row-major Tensor with null values written as
``NaN``s:
+
+ >>> table.to_tensor(null_to_nan=True)
+ <pyarrow.Tensor>
+ type: double
+ shape: (5, 2)
+ strides: (16, 8)
+ >>> table.to_tensor(null_to_nan=True).to_numpy()
+ array([[ 1., 10.],
+ [ 2., 20.],
+ [ 3., 30.],
+ [ 4., 40.],
+ [nan, nan]])
+
+ Convert a Table to column-major Tensor
+
+ >>> table.to_tensor(null_to_nan=True, row_major=False)
+ <pyarrow.Tensor>
+ type: double
+ shape: (5, 2)
+ strides: (8, 40)
+ >>> table.to_tensor(null_to_nan=True, row_major=False).to_numpy()
+ array([[ 1., 10.],
+ [ 2., 20.],
+ [ 3., 30.],
+ [ 4., 40.],
+ [nan, nan]])
+ """
+ cdef:
+ shared_ptr[CTable] c_table
+ shared_ptr[CTensor] c_tensor
+ CMemoryPool* pool = maybe_unbox_memory_pool(memory_pool)
+
+ c_table = pyarrow_unwrap_table(self)
Review Comment:
I know you copied this from RecordBatch, but seeing this now: in the class
itself, you can actually use `self.table` or `self.sp_table` instead of calling
`pyarrow_unwrap_table` to get the CTable.
Then you also don't need to declare `c_table`, but I think it should work to
directly do something like `GetResultValue(self.table.ToTensor(...))`
##########
python/pyarrow/tests/test_table.py:
##########
@@ -1219,6 +1219,295 @@ def test_recordbatch_to_tensor_unsupported():
batch.to_tensor()
[email protected]('typ', [
Review Comment:
We are adding a lot of code to this test file, while in practice it's not
really testing anything new. I am wondering if we could either parametrize some
existing tests for RecordBatch to run with both RecordBatch and Table (we have
some other tests in this file that are parametrized that way, see eg
`test_table_basics`), or otherwise only test some unique aspects of
Table.to_tensor (the fact that the method exists and works, a case with
multiple chunks), but for the other aspects (type promotion, null handling,
etc) rely on the record batch tests.
##########
cpp/src/arrow/table.h:
##########
@@ -101,6 +101,19 @@ class ARROW_EXPORT Table {
static Result<std::shared_ptr<Table>> FromChunkedStructArray(
const std::shared_ptr<ChunkedArray>& array);
+ /// \brief Convert table with one data type to Tensor
Review Comment:
This "with one data type" is probably a left-over from the first version
(when only that was supported)?
(same for the record batch doc)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]