Hello, I've fleshed out the ideas in the doc in this draft PR: https://github.com/apache/arrow/pull/12775
Feedback on the API design is still welcome. Best, Will Jones On Thu, Mar 24, 2022 at 10:25 AM Will Jones <will.jones...@gmail.com> wrote: > Antoine, > > That's a good question. I think there's a critical part that I haven't > articulated well in the doc yet. > > When converting from Arrow's columnar format to Rows, you have three > options: > > (1) Go through the record batch row-by-row > (2) Iterate through each column of record batch, add column value to each > row > (3) Iterate through smaller sub-batches of the record batch, and do (2) on > each sub batch > > The converter would do (3). In cases I've heard of seems to be the most > performant, though I would welcome others' opinions on that. I imagine > there are some "memory locality" benefits, though I am no expert on that. > > This is most apparent when you look at the following two methods: > > template<T> > class ToRowConverter<T> { > // This is implemented by subclass > virtual arrow::Result<std::vector<T>> > Convert(std::shared_ptr<arrow::RecordBatch> batch); > /// This derived > arrow::Result<std::vector<T>> > RecordBatchToRows(std::shared_ptr<arrow::RecordBatch> batch, size_t > batch_size); > } > > The idea here is that RecordBatchToRows() will convert in smaller slices > dictated by batch_size. A Record Batch with 2 million rows might be > converted 10,000 rows at a time. > > I'm going to update the doc to make that clearer, but does what I > described above seem sensible? > > Best, > Will Jones > > > > On Thu, Mar 24, 2022 at 9:47 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Hello Will, >> >> So the added value would simply be the automatic definition of >> iterator-returning methods? Or am I missing something? >> >> Regards >> >> Antoine. >> >> >> Le 23/03/2022 à 19:36, Will Jones a écrit : >> > Hello Arrow devs, >> > >> > I recently created ARROW-16006 [1] ("Helpers for converting between rows >> > and Arrow objects"), and would appreciate feedback. It's meant for >> > conversion from arbitrary schemas, whereas the existing C++ examples >> > demonstrate fixed schemas (that is, known at compile-time). >> > >> > If you have implemented conversion between Arrow and a row-based data >> > structures in C++ (or tried to): Would these helpers work for your use >> > case? There is an associated draft design doc linked in the issue [2], >> > which is open to comments. >> > >> > Thanks, >> > >> > Will Jones >> > >> > [1] https://issues.apache.org/jira/browse/ARROW-16006 >> > [2] >> > >> https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing >> > >> >