Yes, I realized after posting that my example was faulty because I'm not
creating a new row group every 3 rows. But consider an even simpler example:

https://gist.github.com/tnarg/caa2f098091760255e3c60da2cf17438

 I want to write a single json object:

{
  "foo": false,
  "bars": [1,2,3]
}

I would create two columns in my schema, I choose a row group size of 10,
and write 1 row to the "foo" column and 3 rows to the "bars" column. I get
an error because I didn't write exactly 10 rows to each column. This seems
broken.

gmonroe@blah:~$ ./writer
terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Less than the number of expected rows written in the current
column chunk
Aborted (core dumped)


On 2017-03-13 18:01 (-0400), Wes McKinney <[email protected]> wrote:
> hi Grant,
>
> the exception is coming from
>
>   if (num_rows_ != expected_rows_) {
>     throw ParquetException(
>         "Less than the number of expected rows written in"
>         " the current column chunk");
>   }
>
>
https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159
>
> This is double buggy -- the size of the row group and the number of
> values written is different, but you're writing *more* values than the
> row group contains. I'm opening a JIRA to throw a better exception
>
> See the logic for forming num_rows_ for columns with max_repetition_level
> 0:
>
>
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.cc#L323
>
> num_rows_ is incremented each time a new record begins
> (repetition_level 0). You can write as many repeated values as you
> like in a row group as long as the repetition levels encode the
> corresponding number of records -- if you run into a case where this
> happens, can you open a JIRA so we can add a test case and fix?
>
> Thanks
> Wes
>
> On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote:
> > I should also mention that I built parquet-cpp from github, commit
> > 1c4492a111b00ef48663982171e3face1ca2192d.
> >
> > On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]> wrote:
> >
> >> I'm struggling to get a simple parquet writer working using the c
> >> library. The source is here:
> >>
> >> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
> >>
> >> and I'm compiling like so
> >>
> >> g   --std=c  11 -o writer writer.cc -lparquet -larrow -larrow_io
> >>
> >> When I run this program, I get the following error
> >>
> >> gmonroe@foo:~$ ./writer
> >> terminate called after throwing an instance of
'parquet::ParquetException'
> >>   what():  Less than the number of expected rows written in the current
> >> column chunk
> >> Aborted (core dumped)
> >>
> >> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This
suggests
> >> that every column needs to contain N values such that N
> >> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set
of
> >> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
> >>
> >> Is this a bug in the c   library or am I missing something in the API?
> >>
> >> Regards,
> >> Grant Monroe
> >>
>

Reply via email to