All, I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.
Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY. In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group. The quantity is verified by the return value in values_read. byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,&values_read); Column 4 is dictionary encoded. Upon return from its ReadBatch() call, the result vector of BYTE_ARRAY descriptors (rolColPtr) has correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read. As soon as the the ReadBatch() call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed. Is this expected behavior or a bug? If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again. Is this the case? Thanks for clarifying, -Brian