This is an automated email from the ASF dual-hosted git repository.
kszucs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 357eb6b ARROW-7971: [Rust] Create rowcount utility
357eb6b is described below
commit 357eb6bb79e6ed1a196b7101f7c3e7b425a08f70
Author: Ken Suenobu <[email protected]>
AuthorDate: Mon Mar 2 11:02:22 2020 +0100
ARROW-7971: [Rust] Create rowcount utility
This utility introduces a way to count the number of rows present in one or
more Parquet files. This was tested against the Parquet payloads in the
`python/pyarrow/tests/data/parquet/` directory. I created this utility out of
necessity, as the `parquet-tools` project introduces applications written using
Java. This is a much faster alternative, and allows for multiple files to be
counted at a time, rather than `parquet-tools` ability to only count one file
at a time.
Closes #6511 from KenSuenobu/parquet-rowcount-rust and squashes the
following commits:
76ce2d27e <Ken Suenobu> Renamed variable.
794237e46 <Ken Suenobu> Changed code to use row_groups() from metadata as
suggested by @andygrove
31ebd3627 <Ken Suenobu> Removed formatted space.
819df7a58 <Ken Suenobu> Creation of parquet-rowcount tool to help count
number of rows in a Parquet file.
Authored-by: Ken Suenobu <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
---
rust/parquet/README.md | 4 ++
rust/parquet/src/bin/parquet-rowcount.rs | 74 ++++++++++++++++++++++++++++++++
2 files changed, 78 insertions(+)
diff --git a/rust/parquet/README.md b/rust/parquet/README.md
index ea62e4a..aed5cec 100644
--- a/rust/parquet/README.md
+++ b/rust/parquet/README.md
@@ -97,6 +97,10 @@ and optional `verbose` is the boolean flag that allows to
print full metadata or
and `num-records` is the number of records to read from a file (when not
specified all records will
be printed).
+- **parquet-rowcount** for reporting the number of records in one or more
Parquet files.
+`Usage: parquet-rowcount <file-path> ...`, where `file-path` is the path to a
Parquet file, and `...`
+indicates any number of additional parquet files.
+
If you see `Library not loaded` error, please make sure `LD_LIBRARY_PATH` is
set properly:
```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(rustc --print sysroot)/lib
diff --git a/rust/parquet/src/bin/parquet-rowcount.rs
b/rust/parquet/src/bin/parquet-rowcount.rs
new file mode 100644
index 0000000..a51e587
--- /dev/null
+++ b/rust/parquet/src/bin/parquet-rowcount.rs
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Binary file to return the number of rows found from Parquet file(s).
+//!
+//! # Install
+//!
+//! `parquet-rowcount` can be installed using `cargo`:
+//! ```
+//! cargo install parquet
+//! ```
+//! After this `parquet-rowcount` should be globally available:
+//! ```
+//! parquet-rowcount XYZ.parquet
+//! ```
+//!
+//! The binary can also be built from the source code and run as follows:
+//! ```
+//! cargo run --bin parquet-rowcount XYZ.parquet ABC.parquet ZXC.parquet
+//! ```
+//!
+//! # Usage
+//!
+//! ```
+//! parquet-rowcount <file-path> ...
+//! ```
+//! where `file-path` is the path to a Parquet file and `...` is any
additional number of
+//! parquet files to count the number of rows from.
+//!
+//! Note that `parquet-rowcount` reads full file schema, no projection or
filtering is
+//! applied.
+
+extern crate parquet;
+
+use std::{env, fs::File, path::Path, process};
+
+use parquet::file::reader::{FileReader, SerializedFileReader};
+
+fn main() {
+ let args: Vec<String> = env::args().collect();
+ if args.len() < 2 {
+ println!("Usage: parquet-rowcount <file-path> ...");
+ process::exit(1);
+ }
+
+ for i in 1..args.len() {
+ let filename = args[i].clone();
+ let path = Path::new(&filename);
+ let file = File::open(&path).unwrap();
+ let parquet_reader = SerializedFileReader::new(file).unwrap();
+ let row_group_metadata = parquet_reader.metadata().row_groups();
+ let mut total_num_rows = 0;
+
+ for group_metadata in row_group_metadata {
+ total_num_rows += group_metadata.num_rows();
+ }
+
+ eprintln!("File {}: rowcount={}", filename, total_num_rows);
+ }
+}