[GitHub] [arrow] alamb commented on a change in pull request #8992: ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec

GitBox Wed, 23 Dec 2020 05:16:12 -0800


alamb commented on a change in pull request #8992:
URL: https://github.com/apache/arrow/pull/8992#discussion_r547952117




##########
File path: rust/datafusion/src/physical_plan/parquet.rs
##########
@@ -30,17 +30,17 @@ use crate::physical_plan::{common, Partitioning};
 use arrow::datatypes::{Schema, SchemaRef};
 use arrow::error::{ArrowError, Result as ArrowResult};
 use arrow::record_batch::RecordBatch;
-use parquet::file::metadata::ParquetMetaData;
 use parquet::file::reader::SerializedFileReader;
 
 use crossbeam::channel::{bounded, Receiver, RecvError, Sender};
 use fmt::Debug;
 use parquet::arrow::{ArrowReader, ParquetFileArrowReader};
 
+use crate::datasource::datasource::Statistics;
 use async_trait::async_trait;
 use futures::stream::Stream;
 
-/// Execution plan for scanning a Parquet file
+/// Execution plan for scanning one or more Parquet files

Review comment:
       👍 

##########
File path: rust/datafusion/src/physical_plan/parquet.rs
##########
@@ -67,14 +67,35 @@ impl ParquetExec {
         if filenames.is_empty() {
             Err(DataFusionError::Plan("No files found".to_string()))
         } else {
+            // Calculate statistics for the entire data set. Later, we will 
probably want to make
+            // statistics available on a per-partition basis.
+            let mut num_rows = 0;
+            let mut total_byte_size = 0;
+            for file in &filenames {
+                let file = File::open(file)?;
+                let file_reader = Arc::new(SerializedFileReader::new(file)?);

Review comment:
       It probably doesn't matter but we are creating `arrow_reader`s several 
times for the same file -- like here we create them just to read metadata, and 
then right below we (re)open the first one again to read the schema. And then 
we open them again to actually read data... 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] alamb commented on a change in pull request #8992: ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec

Reply via email to