See https://issues.apache.org/jira/browse/PARQUET-914
On Mon, Mar 13, 2017 at 6:01 PM, Wes McKinney <[email protected]> wrote: > hi Grant, > > the exception is coming from > > if (num_rows_ != expected_rows_) { > throw ParquetException( > "Less than the number of expected rows written in" > " the current column chunk"); > } > > https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159 > > This is double buggy -- the size of the row group and the number of > values written is different, but you're writing *more* values than the > row group contains. I'm opening a JIRA to throw a better exception > > See the logic for forming num_rows_ for columns with max_repetition_level > 0: > > https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.cc#L323 > > num_rows_ is incremented each time a new record begins > (repetition_level 0). You can write as many repeated values as you > like in a row group as long as the repetition levels encode the > corresponding number of records -- if you run into a case where this > happens, can you open a JIRA so we can add a test case and fix? > > Thanks > Wes > > On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote: >> I should also mention that I built parquet-cpp from github, commit >> 1c4492a111b00ef48663982171e3face1ca2192d. >> >> On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]> wrote: >> >>> I'm struggling to get a simple parquet writer working using the c++ >>> library. The source is here: >>> >>> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1 >>> >>> and I'm compiling like so >>> >>> g++ --std=c++11 -o writer writer.cc -lparquet -larrow -larrow_io >>> >>> When I run this program, I get the following error >>> >>> gmonroe@foo:~$ ./writer >>> terminate called after throwing an instance of 'parquet::ParquetException' >>> what(): Less than the number of expected rows written in the current >>> column chunk >>> Aborted (core dumped) >>> >>> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This suggests >>> that every column needs to contain N values such that N >>> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set of >>> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1. >>> >>> Is this a bug in the c++ library or am I missing something in the API? >>> >>> Regards, >>> Grant Monroe >>>
