Thx for making my day !~ ;D
On 29 December 2017 at 14:47, Wes McKinney <[email protected]> wrote: > Also, I think you can use the `Scanner` / `TypedScanner` APIs to do > precisely this: > > https://github.com/apache/parquet-cpp/blob/master/src/ > parquet/column_scanner.h#L88 > > These do what the API I described in pseudocode does -- didn't > remember it soon enough for my e-mail. > > On Fri, Dec 29, 2017 at 4:11 PM, ALeX Wang <[email protected]> wrote: > > Hi Wes, > > > > Thanks a lot for your reply, I'll try something as you suggested, > > > > Thanks, > > Alex Wang, > > > > On 29 December 2017 at 11:40, Wes McKinney <[email protected]> wrote: > > > >> hi Alex, > >> > >> I would suggest that you handle batch buffering on the application > >> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the > >> parquet-cpp APIs are intended to be used for batch read and writes, so > >> if you need to read a table row by row, you could create some C++ > >> classes with a particular batch size that manage an internal buffer of > >> values that have been read from the column. > >> > >> As an example, suppose you wish to buffer 1000 values from the column > >> at a time. Then you could create an API that looks like: > >> > >> BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader); > >> buffered_reader.set_batch_size(1000); > >> > >> const ByteArray* val; > >> while (val = buffered_reader.Next()) { > >> // Do something with val > >> } > >> > >> The ByteArray values do not own their data, so if you wish to persist > >> the memory between (internal) calls to ReadBatch, you will have to > >> copy the memory someplace else. We do not perform this copy for you in > >> the low level ReadBatch API because it would hurt performance for > >> users who wish to put the memory someplace else (like in an Arrow > >> columnar array buffer) > >> > >> I recommend looking at the Apache Arrow-based reader API which does > >> all this for you including memory management. > >> > >> Thanks > >> Wes > >> > >> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <[email protected]> wrote: > >> > Hi, > >> > > >> > Assume the column type is of 1-Dimension ByteArray array, (definition > >> level > >> > - 1, and repetition - repeated). > >> > > >> > > >> > If I want to read the column values one row at a time, I have to keep > >> read > >> > (i.e. > >> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At > that > >> > point, I can > >> > construct previously read ByteArrays and return it as for the row. > >> > > >> > However, since 'ByteArray->ptr' points to the column page memory which > >> > (based > >> > on my understanding) will be gone when calling 'HasNext()' and move > to > >> the > >> > next > >> > page. So that means i have to maintain a copy of the 'ByteArray->ptr' > >> for > >> > all the > >> > previously read values. > >> > > >> > This really seems to me to be too complicated.. > >> > Would like to ask if there is a better way of doing: > >> > 1. Reading 1D array in row-by-row fashion. > >> > 2. Zero-copy 'ByteArray->ptr' > >> > > >> > Thanks a lot, > >> > -- > >> > Alex Wang, > >> > Open vSwitch developer > >> > > > > > > > > -- > > Alex Wang, > > Open vSwitch developer > -- Alex Wang, Open vSwitch developer
