Also, I think you can use the `Scanner` / `TypedScanner` APIs to do precisely this:
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_scanner.h#L88 These do what the API I described in pseudocode does -- didn't remember it soon enough for my e-mail. On Fri, Dec 29, 2017 at 4:11 PM, ALeX Wang <[email protected]> wrote: > Hi Wes, > > Thanks a lot for your reply, I'll try something as you suggested, > > Thanks, > Alex Wang, > > On 29 December 2017 at 11:40, Wes McKinney <[email protected]> wrote: > >> hi Alex, >> >> I would suggest that you handle batch buffering on the application >> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the >> parquet-cpp APIs are intended to be used for batch read and writes, so >> if you need to read a table row by row, you could create some C++ >> classes with a particular batch size that manage an internal buffer of >> values that have been read from the column. >> >> As an example, suppose you wish to buffer 1000 values from the column >> at a time. Then you could create an API that looks like: >> >> BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader); >> buffered_reader.set_batch_size(1000); >> >> const ByteArray* val; >> while (val = buffered_reader.Next()) { >> // Do something with val >> } >> >> The ByteArray values do not own their data, so if you wish to persist >> the memory between (internal) calls to ReadBatch, you will have to >> copy the memory someplace else. We do not perform this copy for you in >> the low level ReadBatch API because it would hurt performance for >> users who wish to put the memory someplace else (like in an Arrow >> columnar array buffer) >> >> I recommend looking at the Apache Arrow-based reader API which does >> all this for you including memory management. >> >> Thanks >> Wes >> >> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <[email protected]> wrote: >> > Hi, >> > >> > Assume the column type is of 1-Dimension ByteArray array, (definition >> level >> > - 1, and repetition - repeated). >> > >> > >> > If I want to read the column values one row at a time, I have to keep >> read >> > (i.e. >> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that >> > point, I can >> > construct previously read ByteArrays and return it as for the row. >> > >> > However, since 'ByteArray->ptr' points to the column page memory which >> > (based >> > on my understanding) will be gone when calling 'HasNext()' and move to >> the >> > next >> > page. So that means i have to maintain a copy of the 'ByteArray->ptr' >> for >> > all the >> > previously read values. >> > >> > This really seems to me to be too complicated.. >> > Would like to ask if there is a better way of doing: >> > 1. Reading 1D array in row-by-row fashion. >> > 2. Zero-copy 'ByteArray->ptr' >> > >> > Thanks a lot, >> > -- >> > Alex Wang, >> > Open vSwitch developer >> > > > > -- > Alex Wang, > Open vSwitch developer
