Hello,

I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
ByteArrays.

In my use case, I am only reading flat data but it is data with nulls,
which is why I am using ReadBatchSpaced, because its the only way I have
found to be able to read data and also know which values are null. My code
looks something like this:

int64_t total_values_read = 0;

int64_t valid_bits_offset = 0;
int64_t levels_read = 0;
int64_t values_read = 0;
int64_t null_count = -1;

std::vector<parquet::ByteArray> values(numRecords);
std::vector<int16_t> dresult(numRecords, -1);
std::vector<int16_t> rresult(numRecords, -1);
std::vector<uint8_t> valid_bits(numRecords, 255);

while (total_values_read < numRecords){
int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
dresult.begin() + total_values_read, rresult.data() + total_values_read,
values.begin() + total_values_read, valid_bits.data()  + total_values_read,
valid_bits_offset, &levels_read, &values_read, &null_count);

  total_values_read += rows_read;
}

When I follow this pattern, and I need to do multiple calls to
ReadBatchSpaced, I can get garbage results in my vector of ByteArrays after
the first call in the loop. If I were reading a more primitive data type, I
do not have this issue.
So far the only way I have been able to get this to work is by using an
intermediary buffer to hold the ByteArray data (and I also cannot reuse
that buffer, otherwise I also get bad data). This would look like something
like this:

std::vector<parquet::ByteArray> values(numRecords);
std::vector<int16_t> dresult(numRecords, -1);
std::vector<int16_t> rresult(numRecords, -1);
std::vector<uint8_t> valid_bits(numRecords, 255);

while (total_values_read < numRecords){

       std::vector<parquet::ByteArray> intermediateBuffer(numRecords);

int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
dresult.begin() + total_values_read, rresult.data() + total_values_read,
&(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
valid_bits_offset, &levels_read, &values_read, &null_count);

  std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
rows_read, values.begin() +  total_values_read);

  total_values_read += rows_read;
}


Any ideas as to what I am doing incorrectly in my first example? Do I
always need to use an intermediate buffer?

Thanks!

William


P.S. a team member of mine tried to send this email to the mailing list but
it does not seem to get created. Is there something someone has to do
before they can post to the mailing list?

ᐧ

Reply via email to