Hello,

I've fleshed out the ideas in the doc in this draft PR:
https://github.com/apache/arrow/pull/12775

Feedback on the API design is still welcome.

Best,

Will Jones

On Thu, Mar 24, 2022 at 10:25 AM Will Jones <will.jones...@gmail.com> wrote:

> Antoine,
>
> That's a good question. I think there's a critical part that I haven't
> articulated well in the doc yet.
>
> When converting from Arrow's columnar format to Rows, you have three
> options:
>
> (1) Go through the record batch row-by-row
> (2) Iterate through each column of record batch, add column value to each
> row
> (3) Iterate through smaller sub-batches of the record batch, and do (2) on
> each sub batch
>
> The converter would do (3). In cases I've heard of seems to be the most
> performant, though I would welcome others' opinions on that. I imagine
> there are some "memory locality" benefits, though I am no expert on that.
>
> This is most apparent when you look at the following two methods:
>
> template<T>
> class ToRowConverter<T> {
>     // This is implemented by subclass
>     virtual arrow::Result<std::vector<T>>
> Convert(std::shared_ptr<arrow::RecordBatch> batch);
>    /// This derived
>     arrow::Result<std::vector<T>>
> RecordBatchToRows(std::shared_ptr<arrow::RecordBatch> batch, size_t
> batch_size);
> }
>
> The idea here is that RecordBatchToRows() will convert in smaller slices
> dictated by batch_size. A Record Batch with 2 million rows might be
> converted 10,000 rows at a time.
>
> I'm going to update the doc to make that clearer, but does what I
> described above seem sensible?
>
> Best,
> Will Jones
>
>
>
> On Thu, Mar 24, 2022 at 9:47 AM Antoine Pitrou <anto...@python.org> wrote:
>
>>
>> Hello Will,
>>
>> So the added value would simply be the automatic definition of
>> iterator-returning methods? Or am I missing something?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 23/03/2022 à 19:36, Will Jones a écrit :
>> > Hello Arrow devs,
>> >
>> > I recently created ARROW-16006 [1] ("Helpers for converting between rows
>> > and Arrow objects"), and would appreciate feedback. It's meant for
>> > conversion from arbitrary schemas, whereas the existing C++ examples
>> > demonstrate fixed schemas (that is, known at compile-time).
>> >
>> > If you have implemented conversion between Arrow and a row-based data
>> > structures in C++ (or tried to): Would these helpers work for your use
>> > case? There is an associated draft design doc linked in the issue [2],
>> > which is open to comments.
>> >
>> > Thanks,
>> >
>> > Will Jones
>> >
>> > [1] https://issues.apache.org/jira/browse/ARROW-16006
>> > [2]
>> >
>> https://docs.google.com/document/d/174tldmQLMCvOtjxGtFPeoLBefyE1x26_xntwfSzDXFA/edit?usp=sharing
>> >
>>
>

Reply via email to