Re: [parquet-dev] Efficiently obtaining the number of rows of a Parquet data store.

Brandon Amos Tue, 29 Jul 2014 09:17:13 -0700

Hi Julien,

Is the code snippet below the best way to read the total number of rows in my 
Parquet store?


Thanks,
Brandon.

          val metadataFooters: Seq[Footer] = ParquetFileReader
            .readFooters(conf, new Path(hdfsPath))
          val rows = metadataFooters.foldLeft(0L) {
            (res: Long, footer: Footer) =>
              res + footer.getParquetMetadata.getBlocks.foldLeft(0L) {
                (inner_res: Long, block: BlockMetaData) =>
                  inner_res+block.getRowCount
            }
          }


On Jul 26, 2014, at 7:08 PM, Brandon Amos 
<[email protected]<mailto:[email protected]>> wrote:

Hi Julien,

Thanks for the response.
Can you give further details on how I can go from having a byte array of the
metadata file to extracting the `num_rows` field?

The Thrift schema at [1] doesn't provide the schema of the entire metadata
file, and I can't get any of the Util functions in [2] to read the metadata 
file either.
The `readFileMetaData` function outputs the following error when I try
passing an InputStream of `_metadata`.

java.io.IOException: can not read class parquet.format.FileMetaData: Required 
field 'version' was not found in serialized data! Struct: 
FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)
at parquet.format.Util.read(Util.java:50)
at parquet.format.Util.readFileMetaData(Util.java:34)

Thanks,
Brandon.

[1]: 
https://github.com/apache/incubator-parquet-format/blob/master/src/thrift/parquet.thrift
[2]: 
https://github.com/apache/incubator-parquet-format/blob/master/src/main/java/parquet/format/Util.java

On Jul 26, 2014, at 6:50 PM, JULIEN LE DEM 
<[email protected]<mailto:[email protected]>> wrote:

Hi Brandon,
You could probably make a copy of the thrift definition and keep only the 
fields you need.
If you use the generated classes to read the metadata, thrift will skip all the 
other fields
Julien

On Jul 26, 2014, at 12:16 AM, Brandon Amos wrote:

Hi Parquet team,

I apologize for the simple question, but I'm using Parquet on HDFS in
a Scala/Spark application and am having trouble efficiently
obtaining the number of rows in my Parquet data stores without
loading and counting.

The README at https://github.com/apache/incubator-parquet-format
has great information about the format of the metadata,
and I want to extract the `num_rows` field from the
`FileMetaData` Thrift object.
However, the `_metadata` file contained in Parquet databases
contains many Thrift objects and other information
in addition to the `FileMetaData` object that I want to extract.

Can anybody give recommendations on how I can most efficiently
extract the `num_rows` field?

Thanks,
Brandon.

Re: [parquet-dev] Efficiently obtaining the number of rows of a Parquet data store.

Reply via email to