Re: [parquet-dev] Efficiently obtaining the number of rows of a Parquet data store.

JULIEN LE DEM Sat, 26 Jul 2014 18:51:21 -0700

Hi Brandon,
You could probably make a copy of the thrift definition and keep only the 
fields you need.
If you use the generated classes to read the metadata, thrift will skip all the 
other fields
Julien


On Jul 26, 2014, at 12:16 AM, Brandon Amos wrote:

> Hi Parquet team,
> 
> I apologize for the simple question, but I'm using Parquet on HDFS in
> a Scala/Spark application and am having trouble efficiently
> obtaining the number of rows in my Parquet data stores without
> loading and counting.
> 
> The README at https://github.com/apache/incubator-parquet-format
> has great information about the format of the metadata,
> and I want to extract the `num_rows` field from the
> `FileMetaData` Thrift object.
> However, the `_metadata` file contained in Parquet databases
> contains many Thrift objects and other information
> in addition to the `FileMetaData` object that I want to extract.
> 
> Can anybody give recommendations on how I can most efficiently
> extract the `num_rows` field?
> 
> Thanks,
> Brandon.

Re: [parquet-dev] Efficiently obtaining the number of rows of a Parquet data store.

Reply via email to