This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 357eb6b  ARROW-7971: [Rust] Create rowcount utility
357eb6b is described below

commit 357eb6bb79e6ed1a196b7101f7c3e7b425a08f70
Author: Ken Suenobu <[email protected]>
AuthorDate: Mon Mar 2 11:02:22 2020 +0100

    ARROW-7971: [Rust] Create rowcount utility
    
    This utility introduces a way to count the number of rows present in one or 
more Parquet files.  This was tested against the Parquet payloads in the 
`python/pyarrow/tests/data/parquet/` directory.  I created this utility out of 
necessity, as the `parquet-tools` project introduces applications written using 
Java.  This is a much faster alternative, and allows for multiple files to be 
counted at a time, rather than `parquet-tools` ability to only count one file 
at a time.
    
    Closes #6511 from KenSuenobu/parquet-rowcount-rust and squashes the 
following commits:
    
    76ce2d27e <Ken Suenobu> Renamed variable.
    794237e46 <Ken Suenobu> Changed code to use row_groups() from metadata as 
suggested by @andygrove
    31ebd3627 <Ken Suenobu> Removed formatted space.
    819df7a58 <Ken Suenobu> Creation of parquet-rowcount tool to help count 
number of rows in a Parquet file.
    
    Authored-by: Ken Suenobu <[email protected]>
    Signed-off-by: Krisztián Szűcs <[email protected]>
---
 rust/parquet/README.md                   |  4 ++
 rust/parquet/src/bin/parquet-rowcount.rs | 74 ++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

diff --git a/rust/parquet/README.md b/rust/parquet/README.md
index ea62e4a..aed5cec 100644
--- a/rust/parquet/README.md
+++ b/rust/parquet/README.md
@@ -97,6 +97,10 @@ and optional `verbose` is the boolean flag that allows to 
print full metadata or
 and `num-records` is the number of records to read from a file (when not 
specified all records will
 be printed).
 
+- **parquet-rowcount** for reporting the number of records in one or more 
Parquet files.
+`Usage: parquet-rowcount <file-path> ...`, where `file-path` is the path to a 
Parquet file, and `...`
+indicates any number of additional parquet files.
+
 If you see `Library not loaded` error, please make sure `LD_LIBRARY_PATH` is 
set properly:
 ```
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(rustc --print sysroot)/lib
diff --git a/rust/parquet/src/bin/parquet-rowcount.rs 
b/rust/parquet/src/bin/parquet-rowcount.rs
new file mode 100644
index 0000000..a51e587
--- /dev/null
+++ b/rust/parquet/src/bin/parquet-rowcount.rs
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Binary file to return the number of rows found from Parquet file(s).
+//!
+//! # Install
+//!
+//! `parquet-rowcount` can be installed using `cargo`:
+//! ```
+//! cargo install parquet
+//! ```
+//! After this `parquet-rowcount` should be globally available:
+//! ```
+//! parquet-rowcount XYZ.parquet
+//! ```
+//!
+//! The binary can also be built from the source code and run as follows:
+//! ```
+//! cargo run --bin parquet-rowcount XYZ.parquet ABC.parquet ZXC.parquet
+//! ```
+//!
+//! # Usage
+//!
+//! ```
+//! parquet-rowcount <file-path> ...
+//! ```
+//! where `file-path` is the path to a Parquet file and `...` is any 
additional number of
+//! parquet files to count the number of rows from.
+//!
+//! Note that `parquet-rowcount` reads full file schema, no projection or 
filtering is
+//! applied.
+
+extern crate parquet;
+
+use std::{env, fs::File, path::Path, process};
+
+use parquet::file::reader::{FileReader, SerializedFileReader};
+
+fn main() {
+    let args: Vec<String> = env::args().collect();
+    if args.len() < 2 {
+        println!("Usage: parquet-rowcount <file-path> ...");
+        process::exit(1);
+    }
+
+    for i in 1..args.len() {
+        let filename = args[i].clone();
+        let path = Path::new(&filename);
+        let file = File::open(&path).unwrap();
+        let parquet_reader = SerializedFileReader::new(file).unwrap();
+        let row_group_metadata = parquet_reader.metadata().row_groups();
+        let mut total_num_rows = 0;
+
+        for group_metadata in row_group_metadata {
+            total_num_rows += group_metadata.num_rows();
+        }
+
+        eprintln!("File {}: rowcount={}", filename, total_num_rows);
+    }
+}

Reply via email to