Re: [PR] Implement a `Vec` wrapper for `pyarrow.Table` convenience [arrow-rs]

via GitHub Fri, 07 Nov 2025 08:27:24 -0800


kylebarron commented on code in PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#discussion_r2504312554



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
 //! | `pyarrow.Array`             | [ArrayData]                                
                        |
 //! | `pyarrow.RecordBatch`       | [RecordBatch]                              
                        |
 //! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn 
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table`             | [Table] (2)                                
                        |
 //!
 //! (1) `pyarrow.RecordBatchReader` can be imported as 
[ArrowArrayStreamReader]. Either
 //! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be 
exported
 //! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is 
typically
 //! easier to create.)
 //!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//! 
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a 
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)

Review Comment:
   Wording might be slightly better as
   
   ```
   Although arrow-rs offers Table, a convenience wrapper for `pyarrow.Table`, 
this is ...
   ```



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
     }
 }
 
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to 
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a 
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the 
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` 
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a 
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+///     let batches: Vec<RecordBatch>;
+///     let schema: SchemaRef;
+///     PyArrowType(Table::try_new(batches, schema).map_err(|err| 
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+    record_batches: Vec<RecordBatch>,
+    schema: SchemaRef,
+}
+
+impl Table {
+    pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema: 
SchemaRef) -> Self {
+        Self {
+            record_batches,
+            schema,
+        }
+    }
+
+    pub fn try_new(
+        record_batches: Vec<RecordBatch>,
+        schema: Option<SchemaRef>,

Review Comment:
   I think this should be 
   ```suggestion
           schema: SchemaRef,
   ```
   to better mirror other APIs in arrow-rs



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
 //! | `pyarrow.Array`             | [ArrayData]                                
                        |
 //! | `pyarrow.RecordBatch`       | [RecordBatch]                              
                        |
 //! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn 
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table`             | [Table] (2)                                
                        |
 //!
 //! (1) `pyarrow.RecordBatchReader` can be imported as 
[ArrowArrayStreamReader]. Either
 //! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be 
exported
 //! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is 
typically
 //! easier to create.)
 //!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//! 
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a 
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`), 
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and 
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming 
approaches instead of
+//! dealing with bulk data.
+//! For example, a `pyarrow.Table` can be imported to Rust through 
`PyArrowType<ArrowArrayStreamReader>`
+//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule 
interface).

Review Comment:
   Well actually slight correction, assuming PyCapsule Interface input, both 
`Table` and `ArrowArrayStreamReader` will work with both table and stream input 
out of the box, the difference is just whether the Rust code materializes the 
data. 
   
   This is why I have this table in the pyo3-arrow docs:
   
   <img width="801" height="267" alt="Image" 
src="https://github.com/user-attachments/assets/38d0e89d-c3bc-4e3b-a5ed-fcb11c9e7a78";
 />



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
     }
 }
 
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to 
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a 
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the 
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` 
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a 
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+///     let batches: Vec<RecordBatch>;
+///     let schema: SchemaRef;
+///     PyArrowType(Table::try_new(batches, schema).map_err(|err| 
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+    record_batches: Vec<RecordBatch>,
+    schema: SchemaRef,
+}
+
+impl Table {
+    pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema: 
SchemaRef) -> Self {
+        Self {
+            record_batches,
+            schema,
+        }
+    }
+
+    pub fn try_new(
+        record_batches: Vec<RecordBatch>,
+        schema: Option<SchemaRef>,
+    ) -> Result<Self, ArrowError> {
+        let schema = match schema {
+            Some(s) => s,
+            None => {
+                record_batches
+                    .get(0)
+                    .ok_or_else(|| ArrowError::SchemaError(
+                        "If no schema is supplied explicitly, there must be at 
least one RecordBatch!".to_owned()
+                    ))?
+                    .schema()
+                    .clone()
+            }
+        };
+        for record_batch in &record_batches {
+            if schema != record_batch.schema() {
+                return Err(ArrowError::SchemaError(
+                    "All record batches must have the same schema.".to_owned(),
+                ));
+            }
+        }
+        Ok(Self {
+            record_batches,
+            schema,
+        })
+    }
+
+    pub fn record_batches(&self) -> &[RecordBatch] {
+        &self.record_batches
+    }
+
+    pub fn schema(&self) -> SchemaRef {
+        self.schema.clone()
+    }
+
+    pub fn into_inner(self) -> (Vec<RecordBatch>, SchemaRef) {
+        (self.record_batches, self.schema)
+    }
+}
+
+impl TryFrom<ArrowArrayStreamReader> for Table {

Review Comment:
   I think this would be better to implement on `Box<dyn RecordBatchReader>` or 
`T: RecordBatchReader`



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
 //! | `pyarrow.Array`             | [ArrayData]                                
                        |
 //! | `pyarrow.RecordBatch`       | [RecordBatch]                              
                        |
 //! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn 
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table`             | [Table] (2)                                
                        |
 //!
 //! (1) `pyarrow.RecordBatchReader` can be imported as 
[ArrowArrayStreamReader]. Either
 //! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be 
exported
 //! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is 
typically
 //! easier to create.)
 //!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//! 
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a 
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`), 
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and 
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming 
approaches instead of
+//! dealing with bulk data.
+//! For example, a `pyarrow.Table` can be imported to Rust through 
`PyArrowType<ArrowArrayStreamReader>`
+//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule 
interface).

Review Comment:
   Also reading through the docs again, I'd suggest making a reference to 
`Box<dyn RecordBatchReader>` rather than `ArrowArrayStreamReader`. The former 
is a higher level API and much easier to use.



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
     }
 }
 
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to 
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a 
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the 
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` 
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a 
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+///     let batches: Vec<RecordBatch>;
+///     let schema: SchemaRef;
+///     PyArrowType(Table::try_new(batches, schema).map_err(|err| 
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+    record_batches: Vec<RecordBatch>,
+    schema: SchemaRef,
+}
+
+impl Table {
+    pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema: 
SchemaRef) -> Self {
+        Self {
+            record_batches,
+            schema,
+        }
+    }
+
+    pub fn try_new(
+        record_batches: Vec<RecordBatch>,
+        schema: Option<SchemaRef>,
+    ) -> Result<Self, ArrowError> {
+        let schema = match schema {
+            Some(s) => s,
+            None => {
+                record_batches
+                    .get(0)
+                    .ok_or_else(|| ArrowError::SchemaError(
+                        "If no schema is supplied explicitly, there must be at 
least one RecordBatch!".to_owned()
+                    ))?
+                    .schema()
+                    .clone()
+            }
+        };
+        for record_batch in &record_batches {
+            if schema != record_batch.schema() {
+                return Err(ArrowError::SchemaError(
+                    "All record batches must have the same schema.".to_owned(),
+                ));
+            }
+        }
+        Ok(Self {
+            record_batches,
+            schema,
+        })
+    }
+
+    pub fn record_batches(&self) -> &[RecordBatch] {
+        &self.record_batches
+    }
+
+    pub fn schema(&self) -> SchemaRef {
+        self.schema.clone()
+    }
+
+    pub fn into_inner(self) -> (Vec<RecordBatch>, SchemaRef) {
+        (self.record_batches, self.schema)
+    }
+}
+
+impl TryFrom<ArrowArrayStreamReader> for Table {
+    type Error = ArrowError;
+
+    fn try_from(value: ArrowArrayStreamReader) -> Result<Self, ArrowError> {
+        let schema = value.schema();
+        let batches = value.collect::<Result<Vec<_>, _>>()?;
+        // We assume all batches have the same schema here.
+        unsafe { Ok(Self::new_unchecked(batches, schema)) }
+    }
+}
+
+impl FromPyArrow for Table {
+    fn from_pyarrow_bound(ob: &Bound<PyAny>) -> PyResult<Self> {
+        let array_stream_reader: PyResult<ArrowArrayStreamReader> = {

Review Comment:
   Can we just do 
   ```rs
   let reader: Box<dyn RecordBatchReader> = 
FromPyArrow::from_pyarrow_bound(ob)?;
   Self::try_from(reader)
   ```
   We could have a discussion about what versions of pyarrow we need to 
support, but given that every pyarrow version of the last two years supports 
the PyCapsule Interface, I don't think we need to check for `to_reader` as well.



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
     }
 }
 
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to 
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a 
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the 
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` 
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a 
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+///     let batches: Vec<RecordBatch>;
+///     let schema: SchemaRef;
+///     PyArrowType(Table::try_new(batches, schema).map_err(|err| 
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+    record_batches: Vec<RecordBatch>,
+    schema: SchemaRef,
+}
+
+impl Table {
+    pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema: 
SchemaRef) -> Self {
+        Self {
+            record_batches,
+            schema,
+        }
+    }

Review Comment:
   I'd prefer removing `new_unchecked` and only having `try_new`. Do you have a 
need for `new_unchecked`? The validation to check schemas should be pretty 
minimal



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
 //! | `pyarrow.Array`             | [ArrayData]                                
                        |
 //! | `pyarrow.RecordBatch`       | [RecordBatch]                              
                        |
 //! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn 
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table`             | [Table] (2)                                
                        |
 //!
 //! (1) `pyarrow.RecordBatchReader` can be imported as 
[ArrowArrayStreamReader]. Either
 //! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be 
exported
 //! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is 
typically
 //! easier to create.)
 //!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//! 
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a 
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`), 
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and 
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming 
approaches instead of

Review Comment:
   ```suggestion
   //! bulk as a `pyarrow.Table`. 
   //!
   //! In general, it is recommended to use streaming approaches instead of
   ```
   (If this is meant to be a bulleted list, that might have to be indented)



##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
 //! | `pyarrow.Array`             | [ArrayData]                                
                        |
 //! | `pyarrow.RecordBatch`       | [RecordBatch]                              
                        |
 //! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn 
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table`             | [Table] (2)                                
                        |
 //!
 //! (1) `pyarrow.RecordBatchReader` can be imported as 
[ArrowArrayStreamReader]. Either
 //! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be 
exported
 //! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is 
typically
 //! easier to create.)
 //!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//! 
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a 
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`), 
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and 
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming 
approaches instead of
+//! dealing with bulk data.
+//! For example, a `pyarrow.Table` can be imported to Rust through 
`PyArrowType<ArrowArrayStreamReader>`
+//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule 
interface).

Review Comment:
   I think it would be good to note here that another advantage of using 
`ArrowArrayStreamReader` is that it works with tables _and_ stream input out of 
the box. It doesn't matter which type the user passes in.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement a `Vec` wrapper for `pyarrow.Table` convenience [arrow-rs]

Reply via email to