hi Grant,
the exception is coming from
if (num_rows_ != expected_rows_) {
throw ParquetException(
"Less than the number of expected rows written in"
" the current column chunk");
}
https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159
This is double buggy -- the size of the row group and the number of
values written is different, but you're writing *more* values than the
row group contains. I'm opening a JIRA to throw a better exception
See the logic for forming num_rows_ for columns with max_repetition_level > 0:
https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.cc#L323
num_rows_ is incremented each time a new record begins
(repetition_level 0). You can write as many repeated values as you
like in a row group as long as the repetition levels encode the
corresponding number of records -- if you run into a case where this
happens, can you open a JIRA so we can add a test case and fix?
Thanks
Wes
On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote:
> I should also mention that I built parquet-cpp from github, commit
> 1c4492a111b00ef48663982171e3face1ca2192d.
>
> On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]> wrote:
>
>> I'm struggling to get a simple parquet writer working using the c++
>> library. The source is here:
>>
>> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
>>
>> and I'm compiling like so
>>
>> g++ --std=c++11 -o writer writer.cc -lparquet -larrow -larrow_io
>>
>> When I run this program, I get the following error
>>
>> gmonroe@foo:~$ ./writer
>> terminate called after throwing an instance of 'parquet::ParquetException'
>> what(): Less than the number of expected rows written in the current
>> column chunk
>> Aborted (core dumped)
>>
>> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This suggests
>> that every column needs to contain N values such that N
>> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set of
>> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
>>
>> Is this a bug in the c++ library or am I missing something in the API?
>>
>> Regards,
>> Grant Monroe
>>