kylebarron commented on code in PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#discussion_r2504312554
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
//! | `pyarrow.Array` | [ArrayData]
|
//! | `pyarrow.RecordBatch` | [RecordBatch]
|
//! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table` | [Table] (2)
|
//!
//! (1) `pyarrow.RecordBatchReader` can be imported as
[ArrowArrayStreamReader]. Either
//! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be
exported
//! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is
typically
//! easier to create.)
//!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//!
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
Review Comment:
Wording might be slightly better as
```
Although arrow-rs offers Table, a convenience wrapper for `pyarrow.Table`,
this is ...
```
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
}
}
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>`
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+/// let batches: Vec<RecordBatch>;
+/// let schema: SchemaRef;
+/// PyArrowType(Table::try_new(batches, schema).map_err(|err|
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+ record_batches: Vec<RecordBatch>,
+ schema: SchemaRef,
+}
+
+impl Table {
+ pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema:
SchemaRef) -> Self {
+ Self {
+ record_batches,
+ schema,
+ }
+ }
+
+ pub fn try_new(
+ record_batches: Vec<RecordBatch>,
+ schema: Option<SchemaRef>,
Review Comment:
I think this should be
```suggestion
schema: SchemaRef,
```
to better mirror other APIs in arrow-rs
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
//! | `pyarrow.Array` | [ArrayData]
|
//! | `pyarrow.RecordBatch` | [RecordBatch]
|
//! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table` | [Table] (2)
|
//!
//! (1) `pyarrow.RecordBatchReader` can be imported as
[ArrowArrayStreamReader]. Either
//! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be
exported
//! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is
typically
//! easier to create.)
//!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//!
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`),
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming
approaches instead of
+//! dealing with bulk data.
+//! For example, a `pyarrow.Table` can be imported to Rust through
`PyArrowType<ArrowArrayStreamReader>`
+//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule
interface).
Review Comment:
Well actually slight correction, assuming PyCapsule Interface input, both
`Table` and `ArrowArrayStreamReader` will work with both table and stream input
out of the box, the difference is just whether the Rust code materializes the
data.
This is why I have this table in the pyo3-arrow docs:
<img width="801" height="267" alt="Image"
src="https://github.com/user-attachments/assets/38d0e89d-c3bc-4e3b-a5ed-fcb11c9e7a78"
/>
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
}
}
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>`
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+/// let batches: Vec<RecordBatch>;
+/// let schema: SchemaRef;
+/// PyArrowType(Table::try_new(batches, schema).map_err(|err|
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+ record_batches: Vec<RecordBatch>,
+ schema: SchemaRef,
+}
+
+impl Table {
+ pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema:
SchemaRef) -> Self {
+ Self {
+ record_batches,
+ schema,
+ }
+ }
+
+ pub fn try_new(
+ record_batches: Vec<RecordBatch>,
+ schema: Option<SchemaRef>,
+ ) -> Result<Self, ArrowError> {
+ let schema = match schema {
+ Some(s) => s,
+ None => {
+ record_batches
+ .get(0)
+ .ok_or_else(|| ArrowError::SchemaError(
+ "If no schema is supplied explicitly, there must be at
least one RecordBatch!".to_owned()
+ ))?
+ .schema()
+ .clone()
+ }
+ };
+ for record_batch in &record_batches {
+ if schema != record_batch.schema() {
+ return Err(ArrowError::SchemaError(
+ "All record batches must have the same schema.".to_owned(),
+ ));
+ }
+ }
+ Ok(Self {
+ record_batches,
+ schema,
+ })
+ }
+
+ pub fn record_batches(&self) -> &[RecordBatch] {
+ &self.record_batches
+ }
+
+ pub fn schema(&self) -> SchemaRef {
+ self.schema.clone()
+ }
+
+ pub fn into_inner(self) -> (Vec<RecordBatch>, SchemaRef) {
+ (self.record_batches, self.schema)
+ }
+}
+
+impl TryFrom<ArrowArrayStreamReader> for Table {
Review Comment:
I think this would be better to implement on `Box<dyn RecordBatchReader>` or
`T: RecordBatchReader`
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
//! | `pyarrow.Array` | [ArrayData]
|
//! | `pyarrow.RecordBatch` | [RecordBatch]
|
//! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table` | [Table] (2)
|
//!
//! (1) `pyarrow.RecordBatchReader` can be imported as
[ArrowArrayStreamReader]. Either
//! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be
exported
//! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is
typically
//! easier to create.)
//!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//!
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`),
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming
approaches instead of
+//! dealing with bulk data.
+//! For example, a `pyarrow.Table` can be imported to Rust through
`PyArrowType<ArrowArrayStreamReader>`
+//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule
interface).
Review Comment:
Also reading through the docs again, I'd suggest making a reference to
`Box<dyn RecordBatchReader>` rather than `ArrowArrayStreamReader`. The former
is a higher level API and much easier to use.
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
}
}
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>`
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+/// let batches: Vec<RecordBatch>;
+/// let schema: SchemaRef;
+/// PyArrowType(Table::try_new(batches, schema).map_err(|err|
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+ record_batches: Vec<RecordBatch>,
+ schema: SchemaRef,
+}
+
+impl Table {
+ pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema:
SchemaRef) -> Self {
+ Self {
+ record_batches,
+ schema,
+ }
+ }
+
+ pub fn try_new(
+ record_batches: Vec<RecordBatch>,
+ schema: Option<SchemaRef>,
+ ) -> Result<Self, ArrowError> {
+ let schema = match schema {
+ Some(s) => s,
+ None => {
+ record_batches
+ .get(0)
+ .ok_or_else(|| ArrowError::SchemaError(
+ "If no schema is supplied explicitly, there must be at
least one RecordBatch!".to_owned()
+ ))?
+ .schema()
+ .clone()
+ }
+ };
+ for record_batch in &record_batches {
+ if schema != record_batch.schema() {
+ return Err(ArrowError::SchemaError(
+ "All record batches must have the same schema.".to_owned(),
+ ));
+ }
+ }
+ Ok(Self {
+ record_batches,
+ schema,
+ })
+ }
+
+ pub fn record_batches(&self) -> &[RecordBatch] {
+ &self.record_batches
+ }
+
+ pub fn schema(&self) -> SchemaRef {
+ self.schema.clone()
+ }
+
+ pub fn into_inner(self) -> (Vec<RecordBatch>, SchemaRef) {
+ (self.record_batches, self.schema)
+ }
+}
+
+impl TryFrom<ArrowArrayStreamReader> for Table {
+ type Error = ArrowError;
+
+ fn try_from(value: ArrowArrayStreamReader) -> Result<Self, ArrowError> {
+ let schema = value.schema();
+ let batches = value.collect::<Result<Vec<_>, _>>()?;
+ // We assume all batches have the same schema here.
+ unsafe { Ok(Self::new_unchecked(batches, schema)) }
+ }
+}
+
+impl FromPyArrow for Table {
+ fn from_pyarrow_bound(ob: &Bound<PyAny>) -> PyResult<Self> {
+ let array_stream_reader: PyResult<ArrowArrayStreamReader> = {
Review Comment:
Can we just do
```rs
let reader: Box<dyn RecordBatchReader> =
FromPyArrow::from_pyarrow_bound(ob)?;
Self::try_from(reader)
```
We could have a discussion about what versions of pyarrow we need to
support, but given that every pyarrow version of the last two years supports
the PyCapsule Interface, I don't think we need to check for `to_reader` as well.
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -484,6 +487,137 @@ impl IntoPyArrow for ArrowArrayStreamReader {
}
}
+/// This is a convenience wrapper around `Vec<RecordBatch>` that tries to
simplify conversion from
+/// and to `pyarrow.Table`.
+///
+/// This could be used in circumstances where you either want to consume a
`pyarrow.Table` directly
+/// (although technically, since `pyarrow.Table` implements the
ArrayStreamReader PyCapsule
+/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>`
instead) or, more
+/// importantly, where one wants to export a `pyarrow.Table` from a
`Vec<RecordBatch>` from the Rust
+/// side.
+///
+/// ```ignore
+/// #[pyfunction]
+/// fn return_table(...) -> PyResult<PyArrowType<Table>> {
+/// let batches: Vec<RecordBatch>;
+/// let schema: SchemaRef;
+/// PyArrowType(Table::try_new(batches, schema).map_err(|err|
err.into_py_err(py))?)
+/// }
+/// ```
+#[derive(Clone)]
+pub struct Table {
+ record_batches: Vec<RecordBatch>,
+ schema: SchemaRef,
+}
+
+impl Table {
+ pub unsafe fn new_unchecked(record_batches: Vec<RecordBatch>, schema:
SchemaRef) -> Self {
+ Self {
+ record_batches,
+ schema,
+ }
+ }
Review Comment:
I'd prefer removing `new_unchecked` and only having `try_new`. Do you have a
need for `new_unchecked`? The validation to check schemas should be pretty
minimal
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
//! | `pyarrow.Array` | [ArrayData]
|
//! | `pyarrow.RecordBatch` | [RecordBatch]
|
//! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table` | [Table] (2)
|
//!
//! (1) `pyarrow.RecordBatchReader` can be imported as
[ArrowArrayStreamReader]. Either
//! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be
exported
//! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is
typically
//! easier to create.)
//!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//!
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`),
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming
approaches instead of
Review Comment:
```suggestion
//! bulk as a `pyarrow.Table`.
//!
//! In general, it is recommended to use streaming approaches instead of
```
(If this is meant to be a bulleted list, that might have to be indented)
##########
arrow-pyarrow/src/lib.rs:
##########
@@ -44,17 +44,20 @@
//! | `pyarrow.Array` | [ArrayData]
|
//! | `pyarrow.RecordBatch` | [RecordBatch]
|
//! | `pyarrow.RecordBatchReader` | [ArrowArrayStreamReader] / `Box<dyn
RecordBatchReader + Send>` (1) |
+//! | `pyarrow.Table` | [Table] (2)
|
//!
//! (1) `pyarrow.RecordBatchReader` can be imported as
[ArrowArrayStreamReader]. Either
//! [ArrowArrayStreamReader] or `Box<dyn RecordBatchReader + Send>` can be
exported
//! as `pyarrow.RecordBatchReader`. (`Box<dyn RecordBatchReader + Send>` is
typically
//! easier to create.)
//!
-//! PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn't
-//! have these same concepts. A chunked table is instead represented with
-//! `Vec<RecordBatch>`. A `pyarrow.Table` can be imported to Rust by calling
-//!
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
-//! and then importing the reader as a [ArrowArrayStreamReader].
+//! (2) Although arrow-rs offers a
[pyarrow.Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table)
+//! convenience wrapper [Table] (which internally holds `Vec<RecordBatch>`),
this is more meant for
+//! use cases where you already have `Vec<RecordBatch>` on the Rust side and
want to export that in
+//! bulk as a `pyarrow.Table`. In general, it is recommended to use streaming
approaches instead of
+//! dealing with bulk data.
+//! For example, a `pyarrow.Table` can be imported to Rust through
`PyArrowType<ArrowArrayStreamReader>`
+//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule
interface).
Review Comment:
I think it would be good to note here that another advantage of using
`ArrowArrayStreamReader` is that it works with tables _and_ stream input out of
the box. It doesn't matter which type the user passes in.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]