Also, I think you can use the `Scanner` / `TypedScanner` APIs to do
precisely this:

https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_scanner.h#L88

These do what the API I described in pseudocode does -- didn't
remember it soon enough for my e-mail.

On Fri, Dec 29, 2017 at 4:11 PM, ALeX Wang <[email protected]> wrote:
> Hi Wes,
>
> Thanks a lot for your reply, I'll try something as you suggested,
>
> Thanks,
> Alex Wang,
>
> On 29 December 2017 at 11:40, Wes McKinney <[email protected]> wrote:
>
>> hi Alex,
>>
>> I would suggest that you handle batch buffering on the application
>> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
>> parquet-cpp APIs are intended to be used for batch read and writes, so
>> if you need to read a table row by row, you could create some C++
>> classes with a particular batch size that manage an internal buffer of
>> values that have been read from the column.
>>
>> As an example, suppose you wish to buffer 1000 values from the column
>> at a time. Then you could create an API that looks like:
>>
>> BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader);
>> buffered_reader.set_batch_size(1000);
>>
>> const ByteArray* val;
>> while (val = buffered_reader.Next()) {
>>   // Do something with val
>> }
>>
>> The ByteArray values do not own their data, so if you wish to persist
>> the memory between (internal) calls to ReadBatch, you will have to
>> copy the memory someplace else. We do not perform this copy for you in
>> the low level ReadBatch API because it would hurt performance for
>> users who wish to put the memory someplace else (like in an Arrow
>> columnar array buffer)
>>
>> I recommend looking at the Apache Arrow-based reader API which does
>> all this for you including memory management.
>>
>> Thanks
>> Wes
>>
>> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <[email protected]> wrote:
>> > Hi,
>> >
>> > Assume the column type is of 1-Dimension ByteArray array, (definition
>> level
>> > - 1, and repetition - repeated).
>> >
>> >
>> > If I want to read the column values one row at a time, I have to keep
>> read
>> > (i.e.
>> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that
>> > point, I can
>> > construct previously read ByteArrays and return it as for the row.
>> >
>> > However, since 'ByteArray->ptr' points to the column page memory which
>> > (based
>> > on my understanding)  will be gone when calling 'HasNext()' and move to
>> the
>> > next
>> > page.  So that means i have to maintain a copy of the 'ByteArray->ptr'
>> for
>> > all the
>> > previously read values.
>> >
>> > This really seems to me to be too complicated..
>> > Would like to ask if there is a better way of doing:
>> >    1. Reading 1D array in row-by-row fashion.
>> >    2. Zero-copy 'ByteArray->ptr'
>> >
>> > Thanks a lot,
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>>
>
>
>
> --
> Alex Wang,
> Open vSwitch developer

Reply via email to