jorisvandenbossche commented on issue #43410: URL: https://github.com/apache/arrow/issues/43410#issuecomment-2299154176
Specifically for `pq.write_table()`, this might be a bit trickier (without consuming the stream) because this currently uses `parquet::arrow::FileWriter::WriteTable` which is explicitly requiring a table input. The FileWriter interface has support for writing record batches as well, so we could rewrite the code a bit to iterate over the batches of the stream (but at that point, should that be done in something called `write_table`?) But in general, certainly +1 on more widely supporting the interface. Some other possible areas: - The dataset API for writing. In this case, `pyarrow.dataset.write_dataset` already does accept a record batch reader, so this should be straightforward to extend - Compute functions from `pyarrow.compute` ? Those could certainly accept objects with `__arrow_c_array__`, and in theory also `__arrow_c_stream__`, but those will fully consume the stream and return a materialized result, so not sure if that will be expected? (although, if you know those functions, that is kind of expected, so maybe this just requires good documentation) - Many of the methods on the Array/RecordBatch/Table classes accept similar objects (e.g. `arr.take(..)`). Not sure if we want to make those work with interface objects as well. Although currently what we exactly support as input is a bit inconsistent (only strictly a pyarrow array, or also a numpy array, a list, anything array-like or any sequence or collection? So if we would harmonize that with some helper, then we could at once also easily add support for any arrow-array-like object) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
