Hi Grant,

Can you use the master branch or the 1.0.0-rc5 release and try again? You
will just get the error and not the core dump.

Just to clarify, the  NUM_ROWS_PER_ROW_GROUP value is NOT an upper bound to
the total number of rows in a RowGroup. The number of rows being added must
be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.

On Thu, Mar 16, 2017 at 12:41 AM, Grant Monroe <[email protected]> wrote:

> Yes, I realized after posting that my example was faulty because I'm not
> creating a new row group every 3 rows. But consider an even simpler
> example:
>
> https://gist.github.com/tnarg/caa2f098091760255e3c60da2cf17438
>
>  I want to write a single json object:
>
> {
>   "foo": false,
>   "bars": [1,2,3]
> }
>
> I would create two columns in my schema, I choose a row group size of 10,
> and write 1 row to the "foo" column and 3 rows to the "bars" column. I get
> an error because I didn't write exactly 10 rows to each column. This seems
> broken.
>
> gmonroe@blah:~$ ./writer
> terminate called after throwing an instance of 'parquet::ParquetException'
>   what():  Less than the number of expected rows written in the current
> column chunk
> Aborted (core dumped)
>
>
> On 2017-03-13 18:01 (-0400), Wes McKinney <[email protected]> wrote:
> > hi Grant,
> >
> > the exception is coming from
> >
> >   if (num_rows_ != expected_rows_) {
> >     throw ParquetException(
> >         "Less than the number of expected rows written in"
> >         " the current column chunk");
> >   }
> >
> >
> https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505585c08fd62aa5
> 2f9a6c9afc/src/parquet/column/writer.cc#L159
> >
> > This is double buggy -- the size of the row group and the number of
> > values written is different, but you're writing *more* values than the
> > row group contains. I'm opening a JIRA to throw a better exception
> >
> > See the logic for forming num_rows_ for columns with max_repetition_level
> > 0:
> >
> >
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/column/writer.cc#L323
> >
> > num_rows_ is incremented each time a new record begins
> > (repetition_level 0). You can write as many repeated values as you
> > like in a row group as long as the repetition levels encode the
> > corresponding number of records -- if you run into a case where this
> > happens, can you open a JIRA so we can add a test case and fix?
> >
> > Thanks
> > Wes
> >
> > On Mon, Mar 13, 2017 at 12:14 PM, Grant Monroe <[email protected]> wrote:
> > > I should also mention that I built parquet-cpp from github, commit
> > > 1c4492a111b00ef48663982171e3face1ca2192d.
> > >
> > > On Mon, Mar 13, 2017 at 12:10 PM, Grant Monroe <[email protected]>
> wrote:
> > >
> > >> I'm struggling to get a simple parquet writer working using the c
> > >> library. The source is here:
> > >>
> > >> https://gist.github.com/tnarg/8878a38d4a22104328c4d289319f9ac1
> > >>
> > >> and I'm compiling like so
> > >>
> > >> g   --std=c  11 -o writer writer.cc -lparquet -larrow -larrow_io
> > >>
> > >> When I run this program, I get the following error
> > >>
> > >> gmonroe@foo:~$ ./writer
> > >> terminate called after throwing an instance of
> 'parquet::ParquetException'
> > >>   what():  Less than the number of expected rows written in the
> current
> > >> column chunk
> > >> Aborted (core dumped)
> > >>
> > >> If I change NUM_ROWS_PER_ROW_GROUP=3, this writer succeeds. This
> suggests
> > >> that every column needs to contain N values such that N
> > >> % NUM_ROWS_PER_ROW_GROUP = 0 and N > 0. For an arbitrarily complex set
> of
> > >> values the only reasonable choice for NUM_ROWS_PER_ROW_GROUP is 1.
> > >>
> > >> Is this a bug in the c   library or am I missing something in the API?
> > >>
> > >> Regards,
> > >> Grant Monroe
> > >>
> >
>



-- 
regards,
Deepak Majeti

Reply via email to