hi Alex,

I would suggest that you handle batch buffering on the application
side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
parquet-cpp APIs are intended to be used for batch read and writes, so
if you need to read a table row by row, you could create some C++
classes with a particular batch size that manage an internal buffer of
values that have been read from the column.

As an example, suppose you wish to buffer 1000 values from the column
at a time. Then you could create an API that looks like:

BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader);
buffered_reader.set_batch_size(1000);

const ByteArray* val;
while (val = buffered_reader.Next()) {
  // Do something with val
}

The ByteArray values do not own their data, so if you wish to persist
the memory between (internal) calls to ReadBatch, you will have to
copy the memory someplace else. We do not perform this copy for you in
the low level ReadBatch API because it would hurt performance for
users who wish to put the memory someplace else (like in an Arrow
columnar array buffer)

I recommend looking at the Apache Arrow-based reader API which does
all this for you including memory management.

Thanks
Wes

On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <[email protected]> wrote:
> Hi,
>
> Assume the column type is of 1-Dimension ByteArray array, (definition level
> - 1, and repetition - repeated).
>
>
> If I want to read the column values one row at a time, I have to keep read
> (i.e.
> calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that
> point, I can
> construct previously read ByteArrays and return it as for the row.
>
> However, since 'ByteArray->ptr' points to the column page memory which
> (based
> on my understanding)  will be gone when calling 'HasNext()' and move to the
> next
> page.  So that means i have to maintain a copy of the 'ByteArray->ptr' for
> all the
> previously read values.
>
> This really seems to me to be too complicated..
> Would like to ask if there is a better way of doing:
>    1. Reading 1D array in row-by-row fashion.
>    2. Zero-copy 'ByteArray->ptr'
>
> Thanks a lot,
> --
> Alex Wang,
> Open vSwitch developer

Reply via email to