The diagram reflects the thrift:
https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift
It is slightly out of date (missing fields) but is accurate regarding
entities.
For finding the corresponding classes defined in parquet-mr, refer to this
class:
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java
RowGroups aka Parquet blocks have the row count.
each column chunk as the value count.
value count is equal to row count (it includes nulls) when there is no
repeated fields. Otherwise it is greater.
a repetition level of 0 indicates a new row.
https://blog.twitter.com/2013/dremel-made-simple-with-parquet




On Wed, Aug 27, 2014 at 3:13 PM, pratik khadloya <[email protected]>
wrote:

> I think the metadata relationships diagram on the parquet-mr github page (
> https://github.com/Parquet/parquet-format ) is out of date with what is in
> the code.
>
> I do not see any BlockMetaData in the diagram and seems like ColumnMetaData
> has been renamed to ColumnChunkMetaData, also some of the variable names
> have been changed.
>
> Can anyone with the knowledge of the metadata's please update the diagram?
> It will help a lot in understanding of the code.
>
> Currently i am trying to figure out how to get the total count of the
> number of rows for a given column and if nulls would be included in the
> count or not. I am doing this so that i can read the entire columns in
> memory (have to allocate exact amount of space) and then index them by the
> primary key, so that i can do fast in memory lookups.
>
> Any help would be appreciated.
>
> Thanks,
> ~Pratik
>

Reply via email to