hi Grant,

the exception is coming from

  if (num_rows_ != expected_rows_) {
    throw ParquetException(
        "Less than the number of expected rows written in"
        " the current column chunk");
  }

https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159

This is double buggy -- the size of the row group and the number of
values written is different, but you're writing *more* values than the
row group contains. I'm opening a JIRA to throw a better exception

See the logic for forming num_rows_ for columns with max_repetition_level > 0:

https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.cc#L323

num_rows_ is incremented each time a new record begins
(repetition_level 0). You can write as many repeated values as you
like in a row group as long as the repetition levels encode the
corresponding number of records -- if you run into a case where this
happens, can you open a JIRA so we can add a test case and fix?

Thanks
Wes

On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote:
> I should also mention that I built parquet-cpp from github, commit
> 1c4492a111b00ef48663982171e3face1ca2192d.
>
> On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]> wrote:
>
>> I'm struggling to get a simple parquet writer working using the c++
>> library. The source is here:
>>
>> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
>>
>> and I'm compiling like so
>>
>> g++ --std=c++11 -o writer writer.cc -lparquet -larrow -larrow_io
>>
>> When I run this program, I get the following error
>>
>> gmonroe@foo:~$ ./writer
>> terminate called after throwing an instance of 'parquet::ParquetException'
>>   what():  Less than the number of expected rows written in the current
>> column chunk
>> Aborted (core dumped)
>>
>> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This suggests
>> that every column needs to contain N values such that N
>> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set of
>> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
>>
>> Is this a bug in the c++ library or am I missing something in the API?
>>
>> Regards,
>> Grant Monroe
>>

Reply via email to