Thanks Wes, With that in mind, I’m searching for a public API that returns MAX length value for ByteArray columns. Can you point me to an example?
-Brian On 9/12/19, 5:34 PM, "Wes McKinney" <wesmck...@gmail.com> wrote: EXTERNAL The memory references returned by ReadBatch are not guaranteed to persist from one function call to the next. So you need to copy the ByteArray data into your own data structures before calling ReadBatch again. Column readers for different columns are independent from each other. So function calls for column 7 should not affect anything having to do with column 4. On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman <brian.bow...@sas.com> wrote: > > All, > > I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types. > > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY. > > In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group. The quantity is verified by the return value in values_read. > > byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,&values_read); > > Column 4 is dictionary encoded. Upon return from its ReadBatch() call, the result vector of BYTE_ARRAY descriptors (rolColPtr) has correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read. > > As soon as the the ReadBatch() call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed. > > Is this expected behavior or a bug? If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again. Is this the case? > > Thanks for clarifying, > > > -Brian > > > >