Re: Failing C Parquet Writer

Deepak Majeti Thu, 16 Mar 2017 13:25:16 -0700

As an example, you can look at
https://github.com/apache/parquet-cpp/blob/master/examples/reader-writer.cc#L140
The int64_field column has a list of size 2 in every row.


On Thu, Mar 16, 2017 at 3:56 PM, Wes McKinney <[email protected]> wrote:

> The definition levels depend on the array encoding -- so to account
> for nullable lists and nullable values, the actual definition levels
> (based on the schema) may range from 1 to 3.
>
> I found this exposition in the Impala codebase really useful:
>
> https://github.com/apache/incubator-impala/blob/master/
> be/src/exec/hdfs-parquet-scanner.h#L78
>
>
> On Thu, Mar 16, 2017 at 3:51 PM, Wes McKinney <[email protected]> wrote:
> > hi Grant,
> >
> > The value [1, 2, 3] is only 1 value, not 3. The "Number of rows"
> > passed to the row group is with respect to top level records, *not*
> > counting repeated fields.
> >
> > From https://blog.twitter.com/2013/dremel-made-simple-with-parquet, I
> > believe the correct data to write is:
> >
> > rep level | def level  | value
> > 0         | 1          | 1
> > 1         | 1          | 2
> > 1         | 1          | 3
> >
> > parquet-cpp knows from this data that the 3 values are part of only
> > one logical record
> >
> > Does that make sense?
> >
> > Thanks
> > Wes
> >
> > On Thu, Mar 16, 2017 at 3:40 PM, Grant Monroe <[email protected]> wrote:
> >> Hi Deepak,
> >>
> >>> Can you use the master branch or the 1.0.0-rc5 release and try again?
> You
> >>> will just get the error and not the core dump.
> >>
> >> Upgrading to master does indeed remove the abort().
> >>
> >>> Just to clarify, the NUM_ROWS_PER_ROW_GROUP value is NOT an upper
> bound to
> >>> the total number of rows in a RowGroup. The number of rows being added
> must
> >>> be exactly equal to the NUM_ROWS_PER_ROW_GROUP value.
> >>
> >> I can see that from the error message. My question is, given the
> example JSON object
> >>
> >> {
> >> "foo": false,
> >> "bars": [1,2,3]
> >> }
> >>
> >> how might I store this using the parquet-cpp API? I have one column
> with 1 value and another with 3. The only general solution I can see would
> be to use  NUM_ROWS_PER_ROW_GROUP=1 which seems like nonsense. What am I
> missing? Sample code would be helpful.
> >>
> >> Thanks,
> >> Grant
> >>
>



-- 
regards,
Deepak Majeti

Re: Failing C Parquet Writer

Reply via email to