Thx for making my day !~ ;D


On 29 December 2017 at 14:47, Wes McKinney <[email protected]> wrote:

> Also, I think you can use the `Scanner` / `TypedScanner` APIs to do
> precisely this:
>
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/column_scanner.h#L88
>
> These do what the API I described in pseudocode does -- didn't
> remember it soon enough for my e-mail.
>
> On Fri, Dec 29, 2017 at 4:11 PM, ALeX Wang <[email protected]> wrote:
> > Hi Wes,
> >
> > Thanks a lot for your reply, I'll try something as you suggested,
> >
> > Thanks,
> > Alex Wang,
> >
> > On 29 December 2017 at 11:40, Wes McKinney <[email protected]> wrote:
> >
> >> hi Alex,
> >>
> >> I would suggest that you handle batch buffering on the application
> >> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
> >> parquet-cpp APIs are intended to be used for batch read and writes, so
> >> if you need to read a table row by row, you could create some C++
> >> classes with a particular batch size that manage an internal buffer of
> >> values that have been read from the column.
> >>
> >> As an example, suppose you wish to buffer 1000 values from the column
> >> at a time. Then you could create an API that looks like:
> >>
> >> BufferedColumnReader<ByteArrayType> buffered_reader(batch_reader);
> >> buffered_reader.set_batch_size(1000);
> >>
> >> const ByteArray* val;
> >> while (val = buffered_reader.Next()) {
> >>   // Do something with val
> >> }
> >>
> >> The ByteArray values do not own their data, so if you wish to persist
> >> the memory between (internal) calls to ReadBatch, you will have to
> >> copy the memory someplace else. We do not perform this copy for you in
> >> the low level ReadBatch API because it would hurt performance for
> >> users who wish to put the memory someplace else (like in an Arrow
> >> columnar array buffer)
> >>
> >> I recommend looking at the Apache Arrow-based reader API which does
> >> all this for you including memory management.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > Assume the column type is of 1-Dimension ByteArray array, (definition
> >> level
> >> > - 1, and repetition - repeated).
> >> >
> >> >
> >> > If I want to read the column values one row at a time, I have to keep
> >> read
> >> > (i.e.
> >> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At
> that
> >> > point, I can
> >> > construct previously read ByteArrays and return it as for the row.
> >> >
> >> > However, since 'ByteArray->ptr' points to the column page memory which
> >> > (based
> >> > on my understanding)  will be gone when calling 'HasNext()' and move
> to
> >> the
> >> > next
> >> > page.  So that means i have to maintain a copy of the 'ByteArray->ptr'
> >> for
> >> > all the
> >> > previously read values.
> >> >
> >> > This really seems to me to be too complicated..
> >> > Would like to ask if there is a better way of doing:
> >> >    1. Reading 1D array in row-by-row fashion.
> >> >    2. Zero-copy 'ByteArray->ptr'
> >> >
> >> > Thanks a lot,
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer

Reply via email to