Hi Luca, It seems to me that the problem comes from node->type_length(). It should be 0 instead of 64. Could you please check the value of column_index_ in the CheckColumn() before it throws? If you need further assistance, please create an issue on Github and it would be good to provide a file to reproduce this issue.
Thanks Gang On Tue, Jul 4, 2023 at 5:31 AM Luca Jones <kashara...@gmail.com> wrote: > Hi, > > I've been trying to read data from a Parquet file into a stream using the > Parquet::StreamReader class for a while. The first column of my data > consists of int64s - thus, I have been streaming data as follows: > > shared_ptr<arrow::io::ReadableFile> infile; > PARQUET_ASSIGN_OR_THROW(infile, > arrow::io::ReadableFile::Open(datapath)); > parquet::StreamReader stream{ parquet::ParquetFileReader::Open(infile) > }; > > int64_t c1; > > while (!stream.eof()) { > stream >> c1; > stream.SkipColumns(100); > stream >> parquet::EndRow; > > cout << c1 << endl; > > My code throws a ParquetException in the CheckColumn() function when > comparing length and node->type_length() [stream_reader.cc, Line 543]: > > if (length != node->type_length()) { > throw ParquetException("Column length mismatch. Column '" + > node->name() + > "' has length " + > std::to_string(node->type_length()) + > "] not " + std::to_string(length)); > } > > I figured out that this was because there are empty data fields in my > parquet, meaning length is 0 but node->type_length() is 64. I've looked all > over the internet trying to find a way to properly handle empty values in > parquet files using Arrow, but have had no luck. Is there a way to check if > a data field is empty for a Parquet::StreamReader object, or some other way > to manage empty fields? > > Any help would be appreciated. >