alamb commented on code in PR #10813:
URL: https://github.com/apache/datafusion/pull/10813#discussion_r1630060579


##########
datafusion/core/src/datasource/physical_plan/parquet/access_plan.rs:
##########
@@ -182,6 +183,11 @@ impl ParquetAccessPlan {
     /// is returned for *all* the rows in the row groups that are not skipped.
     /// Thus it includes a `Select` selection for any [`RowGroupAccess::Scan`].
     ///
+    /// # Errors

Review Comment:
   Since users can now provide a `ParquetAccessPlan` it is important to do 
validation on the contents.
   
   While technically we could avoid doing this validation when the selections 
came from the page pruning, I think it would be a good check to have to catch 
future bugs rather than subtle wrong results so I chose to always validate



##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -303,25 +307,8 @@ impl ContextWithParquet {
             .expect("Running");
 
         // find the parquet metrics
-        struct MetricsFinder {
-            metrics: Option<MetricsSet>,
-        }
-        impl ExecutionPlanVisitor for MetricsFinder {
-            type Error = std::convert::Infallible;
-            fn pre_visit(
-                &mut self,
-                plan: &dyn ExecutionPlan,
-            ) -> Result<bool, Self::Error> {
-                if plan.as_any().downcast_ref::<ParquetExec>().is_some() {
-                    self.metrics = plan.metrics();
-                }
-                // stop searching once we have found the metrics
-                Ok(self.metrics.is_none())
-            }
-        }
-        let mut finder = MetricsFinder { metrics: None };
-        accept(physical_plan.as_ref(), &mut finder).unwrap();
-        let parquet_metrics = finder.metrics.unwrap();
+        let parquet_metrics =

Review Comment:
   Pulled into a new file so I could reuse it



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -145,6 +145,52 @@ pub use writer::plan_to_parquet;
 /// custom reader is used, it supplies the metadata directly and this parameter
 /// is ignored. [`ParquetExecBuilder::with_metadata_size_hint`] for more 
details.
 ///
+/// * User provided  [`ParquetAccessPlan`]s to skip row groups and/or pages
+/// based on external information. See "Implementing External Indexes" below
+///
+/// # Implementing External Indexes
+///
+/// It is possible to restrict the row groups and selections within those row
+/// groups that the ParquetExec will consider by providing an initial
+/// [`ParquetAccessPlan`] as `extensions` on [`PartitionedFile`]. This can be
+/// used to implement external indexes on top of parquet files and select only
+/// portions of the files.
+///
+/// The `ParquetExec` will try and further reduce any provided
+/// `ParquetAccessPlan` further based on the contents of `ParquetMetadata` and
+/// other settings.
+///
+/// ## Example of providing a ParquetAccessPlan
+///
+/// ```
+/// # use std::sync::Arc;
+/// # use arrow_schema::{Schema, SchemaRef};
+/// # use datafusion::datasource::listing::PartitionedFile;
+/// # use datafusion::datasource::physical_plan::parquet::ParquetAccessPlan;
+/// # use datafusion::datasource::physical_plan::{FileScanConfig, ParquetExec};
+/// # use datafusion_execution::object_store::ObjectStoreUrl;
+/// # fn schema() -> SchemaRef {
+/// #   Arc::new(Schema::empty())
+/// # }
+/// // create an access plan to scan row group 0, 1 and 3 and skip row groups 
2 and 4
+/// let mut access_plan = ParquetAccessPlan::new_all(5);
+/// access_plan.skip(2);
+/// access_plan.skip(4);
+/// // provide the plan as extension to the FileScanConfig
+/// let partitioned_file = PartitionedFile::new("my_file.parquet", 1234)
+///   .with_extensions(Arc::new(access_plan));
+/// // create a ParquetExec to scan this file
+/// let file_scan_config = 
FileScanConfig::new(ObjectStoreUrl::local_filesystem(), schema())
+///     .with_file(partitioned_file);
+/// // this parquet exec will not even try to read row groups 2 and 4. 
Additional
+/// // pruning based on predicates may also happen
+/// let exec = ParquetExec::builder(file_scan_config).build();
+/// ```
+///
+/// For a complete example, see the [`parquet_index_advanced` example]).
+///
+/// [`parquet_index_advanced` example]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index_advanced.rs

Review Comment:
   This will be added in https://github.com/apache/datafusion/pull/10701



##########
datafusion/core/src/datasource/physical_plan/parquet/access_plan.rs:
##########
@@ -366,7 +410,8 @@ mod test {
                     RowSelector::select(10),
                     // selectors from the second row group
                     RowSelector::select(5),
-                    RowSelector::skip(7)
+                    RowSelector::skip(7),

Review Comment:
   turns out that some of the existing unit tests were actually invalid  -- all 
the actual parquet reader tests were good however



##########
datafusion/core/src/datasource/physical_plan/parquet/opener.rs:
##########
@@ -139,7 +139,8 @@ impl FileOpener for ParquetOpener {
             let predicate = pruning_predicate.as_ref().map(|p| p.as_ref());
             let rg_metadata = file_metadata.row_groups();
             // track which row groups to actually read
-            let access_plan = ParquetAccessPlan::new_all(rg_metadata.len());
+            let access_plan =

Review Comment:
   This is the plumbing, which I am quite pleased with -- it is quite 
straightforward now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to