Yes, I realized after posting that my example was faulty because I'm not creating a new row group every 3 rows. But consider an even simpler example:
https://gist.github.com/tnarg/caa2f098091760255e3c60da2cf17438 I want to write a single json object: { "foo": false, "bars": [1,2,3] } I would create two columns in my schema, I choose a row group size of 10, and write 1 row to the "foo" column and 3 rows to the "bars" column. I get an error because I didn't write exactly 10 rows to each column. This seems broken. gmonroe@blah:~$ ./writer terminate called after throwing an instance of 'parquet::ParquetException' what(): Less than the number of expected rows written in the current column chunk Aborted (core dumped) On 2017-03-13 18:01 (-0400), Wes McKinney <[email protected]> wrote: > hi Grant, > > the exception is coming from > > if (num_rows_ != expected_rows_) { > throw ParquetException( > "Less than the number of expected rows written in" > " the current column chunk"); > } > > https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159 > > This is double buggy -- the size of the row group and the number of > values written is different, but you're writing *more* values than the > row group contains. I'm opening a JIRA to throw a better exception > > See the logic for forming num_rows_ for columns with max_repetition_level > 0: > > https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.cc#L323 > > num_rows_ is incremented each time a new record begins > (repetition_level 0). You can write as many repeated values as you > like in a row group as long as the repetition levels encode the > corresponding number of records -- if you run into a case where this > happens, can you open a JIRA so we can add a test case and fix? > > Thanks > Wes > > On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote: > > I should also mention that I built parquet-cpp from github, commit > > 1c4492a111b00ef48663982171e3face1ca2192d. > > > > On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]> wrote: > > > >> I'm struggling to get a simple parquet writer working using the c > >> library. The source is here: > >> > >> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1 > >> > >> and I'm compiling like so > >> > >> g --std=c 11 -o writer writer.cc -lparquet -larrow -larrow_io > >> > >> When I run this program, I get the following error > >> > >> gmonroe@foo:~$ ./writer > >> terminate called after throwing an instance of 'parquet::ParquetException' > >> what(): Less than the number of expected rows written in the current > >> column chunk > >> Aborted (core dumped) > >> > >> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This suggests > >> that every column needs to contain N values such that N > >> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set of > >> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1. > >> > >> Is this a bug in the c library or am I missing something in the API? > >> > >> Regards, > >> Grant Monroe > >> >
