All,

I’m debugging a low-level API Parquet reader case where the table has DOUBLE, 
BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.

Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.

In the following ReadBatch(), rowsToRead is already set to all rows in the Row 
Group.  The quantity is verified by the return value in values_read.

      
byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,&values_read);

Column 4 is dictionary encoded.  Upon return from its ReadBatch() call,  the 
result vector of BYTE_ARRAY descriptors (rolColPtr) has  correct len/ptr pairs 
pointing into a decoded dictionary string – although not from the original 
dictionary vaues in the .parquet file being read.

As soon as the the ReadBatch()  call is made for the next BYTE_ARRAY column 
(#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for 
column 4 are trashed.

Is this expected behavior or a bug?  If expected, then it seems the dictionary 
values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) 
should be copied and the descriptor vector addresses back-patched, BEFORE 
invoking ReadBatch() again.  Is this the case?

Thanks for clarifying,


-Brian




Reply via email to