Thanks for the detailed information Julien.

Regards,
Pratik


On Fri, Aug 29, 2014 at 4:43 PM, Julien Le Dem <[email protected]>
wrote:

> The diagram reflects the thrift:
>
> https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift
> It is slightly out of date (missing fields) but is accurate regarding
> entities.
> For finding the corresponding classes defined in parquet-mr, refer to this
> class:
>
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java
> RowGroups aka Parquet blocks have the row count.
> each column chunk as the value count.
> value count is equal to row count (it includes nulls) when there is no
> repeated fields. Otherwise it is greater.
> a repetition level of 0 indicates a new row.
> https://blog.twitter.com/2013/dremel-made-simple-with-parquet
>
>
>
>
> On Wed, Aug 27, 2014 at 3:13 PM, pratik khadloya <[email protected]>
> wrote:
>
> > I think the metadata relationships diagram on the parquet-mr github page
> (
> > https://github.com/Parquet/parquet-format ) is out of date with what is
> in
> > the code.
> >
> > I do not see any BlockMetaData in the diagram and seems like
> ColumnMetaData
> > has been renamed to ColumnChunkMetaData, also some of the variable names
> > have been changed.
> >
> > Can anyone with the knowledge of the metadata's please update the
> diagram?
> > It will help a lot in understanding of the code.
> >
> > Currently i am trying to figure out how to get the total count of the
> > number of rows for a given column and if nulls would be included in the
> > count or not. I am doing this so that i can read the entire columns in
> > memory (have to allocate exact amount of space) and then index them by
> the
> > primary key, so that i can do fast in memory lookups.
> >
> > Any help would be appreciated.
> >
> > Thanks,
> > ~Pratik
> >
>

Reply via email to