See https://issues.apache.org/jira/browse/PARQUET-914

On Mon, Mar 13, 2017 at 6:01 PM, Wes McKinney <[email protected]> wrote:
> hi Grant,
>
> the exception is coming from
>
>   if (num_rows_ != expected_rows_) {
>     throw ParquetException(
>         "Less than the number of expected rows written in"
>         " the current column chunk");
>   }
>
> https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159
>
> This is double buggy -- the size of the row group and the number of
> values written is different, but you're writing *more* values than the
> row group contains. I'm opening a JIRA to throw a better exception
>
> See the logic for forming num_rows_ for columns with max_repetition_level > 0:
>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.cc#L323
>
> num_rows_ is incremented each time a new record begins
> (repetition_level 0). You can write as many repeated values as you
> like in a row group as long as the repetition levels encode the
> corresponding number of records -- if you run into a case where this
> happens, can you open a JIRA so we can add a test case and fix?
>
> Thanks
> Wes
>
> On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote:
>> I should also mention that I built parquet-cpp from github, commit
>> 1c4492a111b00ef48663982171e3face1ca2192d.
>>
>> On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]> wrote:
>>
>>> I'm struggling to get a simple parquet writer working using the c++
>>> library. The source is here:
>>>
>>> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
>>>
>>> and I'm compiling like so
>>>
>>> g++ --std=c++11 -o writer writer.cc -lparquet -larrow -larrow_io
>>>
>>> When I run this program, I get the following error
>>>
>>> gmonroe@foo:~$ ./writer
>>> terminate called after throwing an instance of 'parquet::ParquetException'
>>>   what():  Less than the number of expected rows written in the current
>>> column chunk
>>> Aborted (core dumped)
>>>
>>> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This suggests
>>> that every column needs to contain N values such that N
>>> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set of
>>> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
>>>
>>> Is this a bug in the c++ library or am I missing something in the API?
>>>
>>> Regards,
>>> Grant Monroe
>>>

Reply via email to