Re: Issues using TypedColumnReader::ReadBatchSpaced

Uwe L. Korn Tue, 07 Nov 2017 05:14:50 -0800

Hello William,

Seems like you got the problem Felipe earlier mentioned. My response to
that was:


the parquet::ByteArray instances don't own the data, so their internal
pointer might get invalid on the next call to ReadBatchSpaced. This
should actually make no difference if you that intermediateBuffer or
not. Thus the second code snippet might also fail. In general, I can
recommend you to look at the parquet_arrow implementation on how to read
files in parquet-cpp: src/parquet/arrow/reader*. Depending on your use
case, it might also be simpler for you to use this API as Arrow data
structures are much simpler to consume and hide some of the
implementation details of the Parquet format.

If that does not solve your problem, feel free ask more ;)

Uwe

On Fri, Nov 3, 2017, at 12:07 AM, William Malpica wrote:
> Hello,
> 
> I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read
> ByteArrays.
> 
> In my use case, I am only reading flat data but it is data with nulls,
> which is why I am using ReadBatchSpaced, because its the only way I have
> found to be able to read data and also know which values are null. My
> code
> looks something like this:
> 
> int64_t total_values_read = 0;
> 
> int64_t valid_bits_offset = 0;
> int64_t levels_read = 0;
> int64_t values_read = 0;
> int64_t null_count = -1;
> 
> std::vector<parquet::ByteArray> values(numRecords);
> std::vector<int16_t> dresult(numRecords, -1);
> std::vector<int16_t> rresult(numRecords, -1);
> std::vector<uint8_t> valid_bits(numRecords, 255);
> 
> while (total_values_read < numRecords){
> int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> dresult.begin() + total_values_read, rresult.data() + total_values_read,
> values.begin() + total_values_read, valid_bits.data()  +
> total_values_read,
> valid_bits_offset, &levels_read, &values_read, &null_count);
> 
>   total_values_read += rows_read;
> }
> 
> When I follow this pattern, and I need to do multiple calls to
> ReadBatchSpaced, I can get garbage results in my vector of ByteArrays
> after
> the first call in the loop. If I were reading a more primitive data type,
> I
> do not have this issue.
> So far the only way I have been able to get this to work is by using an
> intermediary buffer to hold the ByteArray data, which would look more
> something like this:
> 
> std::vector<parquet::ByteArray> values(numRecords);
> std::vector<parquet::ByteArray> intermediateBuffer(numRecords);
> std::vector<int16_t> dresult(numRecords, -1);
> std::vector<int16_t> rresult(numRecords, -1);
> std::vector<uint8_t> valid_bits(numRecords, 255);
> 
> while (total_values_read < numRecords){
> int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords,
> dresult.begin() + total_values_read, rresult.data() + total_values_read,
> &(intermediateBuffer[0]), valid_bits.data()  + total_values_read,
> valid_bits_offset, &levels_read, &values_read, &null_count);
> 
>   std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() +
> rows_read, values.begin() +  total_values_read);
> 
>   total_values_read += rows_read;
> }
> 
> 
> Any ideas as to what I am doing incorrectly in my first example? Do I
> always need to use an intermediate buffer?
> 
> Thanks!
> 
> William
> 
> 
> 
> [image: BlazingDB] <https://htmlsig.com/t/000001C10NAQ>
> 
> William Malpica / VP of Engineering
> [email protected] / 859.619.0708
> 
> BlazingDB
> www.blazingdb.com
> 
> [image: Twitter]  <https://htmlsig.com/t/000001C0FQGA> [image: Facebook]
> <https://htmlsig.com/t/000001C3NKBJ> [image: LinkedIn]
> <https://htmlsig.com/t/000001BZZ9SC> [image: Vimeo]
> <https://htmlsig.com/t/000001BYP7S2>

Re: Issues using TypedColumnReader::ReadBatchSpaced

Reply via email to