Hi Luca,

It seems to me that the problem comes from node->type_length(). It should
be 0 instead of 64. Could you please check the value of column_index_ in
the
CheckColumn() before it throws? If you need further assistance, please
create
an issue on Github and it would be good to provide a file to reproduce this
issue.

Thanks
Gang


On Tue, Jul 4, 2023 at 5:31 AM Luca Jones <kashara...@gmail.com> wrote:

> Hi,
>
> I've been trying to read data from a Parquet file into a stream using the
> Parquet::StreamReader class for a while. The first column of my data
> consists of int64s - thus, I have been streaming data as follows:
>
>     shared_ptr<arrow::io::ReadableFile> infile;
>     PARQUET_ASSIGN_OR_THROW(infile,
> arrow::io::ReadableFile::Open(datapath));
>     parquet::StreamReader stream{ parquet::ParquetFileReader::Open(infile)
> };
>
>     int64_t c1;
>
>     while (!stream.eof()) {
>         stream >> c1;
>         stream.SkipColumns(100);
>         stream >> parquet::EndRow;
>
>         cout << c1 << endl;
>
> My code throws a ParquetException in the CheckColumn() function when
> comparing length and node->type_length() [stream_reader.cc, Line 543]:
>
>   if (length != node->type_length()) {
>     throw ParquetException("Column length mismatch.  Column '" +
> node->name() +
>                            "' has length " +
> std::to_string(node->type_length()) +
>                            "] not " + std::to_string(length));
>   }
>
> I figured out that this was because there are empty data fields in my
> parquet, meaning length is 0 but node->type_length() is 64. I've looked all
> over the internet trying to find a way to properly handle empty values in
> parquet files using Arrow, but have had no luck. Is there a way to check if
> a data field is empty for a Parquet::StreamReader object, or some other way
> to manage empty fields?
>
> Any help would be appreciated.
>

Reply via email to