Hello Uwe, Yes this was the same issue as posted by Felipe. He posted the issue for me, since I did not have access to the mailing list. Thanks for the info, it explains the patterns that I have been seeing.
Thanks, William On 2017-11-07 07:14, "Uwe L. Korn" <[email protected]> wrote: > Hello William, > > Seems like you got the problem Felipe earlier mentioned. My response to > that was: > > the parquet::ByteArray instances don't own the data, so their internal > pointer might get invalid on the next call to ReadBatchSpaced. This > should actually make no difference if you that intermediateBuffer or > not. Thus the second code snippet might also fail. In general, I can > recommend you to look at the parquet_arrow implementation on how to read > files in parquet-cpp: src/parquet/arrow/reader*. Depending on your use > case, it might also be simpler for you to use this API as Arrow data > structures are much simpler to consume and hide some of the > implementation details of the Parquet format. > > If that does not solve your problem, feel free ask more ;) > > Uwe > > On Fri, Nov 3, 2017, at 12:07 AM, William Malpica wrote: > > Hello, > > > > I am trying to use TypedColumnReader<DType>::ReadBatchSpaced to read > > ByteArrays. > > > > In my use case, I am only reading flat data but it is data with nulls, > > which is why I am using ReadBatchSpaced, because its the only way I have > > found to be able to read data and also know which values are null. My > > code > > looks something like this: > > > > int64_t total_values_read = 0; > > > > int64_t valid_bits_offset = 0; > > int64_t levels_read = 0; > > int64_t values_read = 0; > > int64_t null_count = -1; > > > > std::vector<parquet::ByteArray> values(numRecords); > > std::vector<int16_t> dresult(numRecords, -1); > > std::vector<int16_t> rresult(numRecords, -1); > > std::vector<uint8_t> valid_bits(numRecords, 255); > > > > while (total_values_read < numRecords){ > > int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords, > > dresult.begin() + total_values_read, rresult.data() + total_values_read, > > values.begin() + total_values_read, valid_bits.data() + > > total_values_read, > > valid_bits_offset, &levels_read, &values_read, &null_count); > > > > total_values_read += rows_read; > > } > > > > When I follow this pattern, and I need to do multiple calls to > > ReadBatchSpaced, I can get garbage results in my vector of ByteArrays > > after > > the first call in the loop. If I were reading a more primitive data type, > > I > > do not have this issue. > > So far the only way I have been able to get this to work is by using an > > intermediary buffer to hold the ByteArray data, which would look more > > something like this: > > > > std::vector<parquet::ByteArray> values(numRecords); > > std::vector<parquet::ByteArray> intermediateBuffer(numRecords); > > std::vector<int16_t> dresult(numRecords, -1); > > std::vector<int16_t> rresult(numRecords, -1); > > std::vector<uint8_t> valid_bits(numRecords, 255); > > > > while (total_values_read < numRecords){ > > int64_t rows_read = parquetTypeReader->ReadBatchSpaced(numRecords, > > dresult.begin() + total_values_read, rresult.data() + total_values_read, > > &(intermediateBuffer[0]), valid_bits.data() + total_values_read, > > valid_bits_offset, &levels_read, &values_read, &null_count); > > > > std::copy(intermediateBuffer.begin(), intermediateBuffer.begin() + > > rows_read, values.begin() + total_values_read); > > > > total_values_read += rows_read; > > } > > > > > > Any ideas as to what I am doing incorrectly in my first example? Do I > > always need to use an intermediate buffer? > > > > Thanks! > > > > William > > > > > > > > [image: BlazingDB] <https://htmlsig.com/t/000001C10NAQ> > > > > William Malpica / VP of Engineering > > [email protected] / 859.619.0708 > > > > BlazingDB > > www.blazingdb.com > > > > [image: Twitter] <https://htmlsig.com/t/000001C0FQGA> [image: Facebook] > > <https://htmlsig.com/t/000001C3NKBJ> [image: LinkedIn] > > <https://htmlsig.com/t/000001BZZ9SC> [image: Vimeo] > > <https://htmlsig.com/t/000001BYP7S2> >
