Thanks for the detailed information Julien. Regards, Pratik
On Fri, Aug 29, 2014 at 4:43 PM, Julien Le Dem <[email protected]> wrote: > The diagram reflects the thrift: > > https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift > It is slightly out of date (missing fields) but is accurate regarding > entities. > For finding the corresponding classes defined in parquet-mr, refer to this > class: > > https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java > RowGroups aka Parquet blocks have the row count. > each column chunk as the value count. > value count is equal to row count (it includes nulls) when there is no > repeated fields. Otherwise it is greater. > a repetition level of 0 indicates a new row. > https://blog.twitter.com/2013/dremel-made-simple-with-parquet > > > > > On Wed, Aug 27, 2014 at 3:13 PM, pratik khadloya <[email protected]> > wrote: > > > I think the metadata relationships diagram on the parquet-mr github page > ( > > https://github.com/Parquet/parquet-format ) is out of date with what is > in > > the code. > > > > I do not see any BlockMetaData in the diagram and seems like > ColumnMetaData > > has been renamed to ColumnChunkMetaData, also some of the variable names > > have been changed. > > > > Can anyone with the knowledge of the metadata's please update the > diagram? > > It will help a lot in understanding of the code. > > > > Currently i am trying to figure out how to get the total count of the > > number of rows for a given column and if nulls would be included in the > > count or not. I am doing this so that i can read the entire columns in > > memory (have to allocate exact amount of space) and then index them by > the > > primary key, so that i can do fast in memory lookups. > > > > Any help would be appreciated. > > > > Thanks, > > ~Pratik > > >
