Re: Failing C++ Parquet Writer

Deepak Majeti Mon, 13 Mar 2017 15:32:00 -0700

I ran the same program on master and I see the following error.
"Parquet write error: More rows were written in the column chunk than
expected"


This bug should throw at  https://github.com/apache/parquet-cpp/blob/
5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L337

However, I see the same problem as Grant posted for commit
1c4492a111b00ef48663982171e3face1ca2192d
The core dump is because of two parquet exceptions being handled. This is
fixed in commit 076011b08498317d213cdbc0a64128a5dd8da4c0.

First exception at https://github.com/apache/parquet-cpp/blob/
5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L337
Now the Parquet Writer destructor tries to write close the file and
encounters https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505
585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159


On Mon, Mar 13, 2017 at 6:03 PM, Wes McKinney <[email protected]> wrote:

> See https://issues.apache.org/jira/browse/PARQUET-914
>
> On Mon, Mar 13, 2017 at 6:01 PM, Wes McKinney <[email protected]> wrote:
> > hi Grant,
> >
> > the exception is coming from
> >
> >   if (num_rows_ != expected_rows_) {
> >     throw ParquetException(
> >         "Less than the number of expected rows written in"
> >         " the current column chunk");
> >   }
> >
> > https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505
> 585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159
> >
> > This is double buggy -- the size of the row group and the number of
> > values written is different, but you're writing *more* values than the
> > row group contains. I'm opening a JIRA to throw a better exception
> >
> > See the logic for forming num_rows_ for columns with
> max_repetition_level > 0:
> >
> > https://github.com/apache/parquet-cpp/blob/master/src/parque
> t/column/writer.cc#L323
> >
> > num_rows_ is incremented each time a new record begins
> > (repetition_level 0). You can write as many repeated values as you
> > like in a row group as long as the repetition levels encode the
> > corresponding number of records -- if you run into a case where this
> > happens, can you open a JIRA so we can add a test case and fix?
> >
> > Thanks
> > Wes
> >
> > On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote:
> >> I should also mention that I built parquet-cpp from github, commit
> >> 1c4492a111b00ef48663982171e3face1ca2192d.
> >>
> >> On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]> wrote:
> >>
> >>> I'm struggling to get a simple parquet writer working using the c++
> >>> library. The source is here:
> >>>
> >>> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
> >>>
> >>> and I'm compiling like so
> >>>
> >>> g++ --std=c++11 -o writer writer.cc -lparquet -larrow -larrow_io
> >>>
> >>> When I run this program, I get the following error
> >>>
> >>> gmonroe@foo:~$ ./writer
> >>> terminate called after throwing an instance of
> 'parquet::ParquetException'
> >>>   what():  Less than the number of expected rows written in the current
> >>> column chunk
> >>> Aborted (core dumped)
> >>>
> >>> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This
> suggests
> >>> that every column needs to contain N values such that N
> >>> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set
> of
> >>> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
> >>>
> >>> Is this a bug in the c++ library or am I missing something in the API?
> >>>
> >>> Regards,
> >>> Grant Monroe
> >>>
>



-- 
regards,
Deepak Majeti

Re: Failing C++ Parquet Writer

Reply via email to