adriangb commented on code in PR #20188:
URL: https://github.com/apache/datafusion/pull/20188#discussion_r2774758388
##########
datafusion/datasource/src/file_scan_config.rs:
##########
@@ -55,10 +55,21 @@ use datafusion_physical_plan::{
use log::{debug, warn};
use std::{any::Any, fmt::Debug, fmt::Formatter, fmt::Result as FmtResult,
sync::Arc};
-/// The base configurations for a [`DataSourceExec`], the a physical plan for
-/// any given file format.
+/// [`FileScanConfig`] represents scanning data from a group of files
///
-/// Use [`DataSourceExec::from_data_source`] to create a [`DataSourceExec`]
from a ``FileScanConfig`.
+/// `FileScanConfig` is used to create a [`DataSourceExec`], the physical plan
+/// for scanning files with a particular file format.
+///
+/// The [`FileSource`] (e.g. `ParquetSource`, `CsvSource`, etc.) is responsible
+/// for creating the actual execution plan to read the files based on a
+/// `FileScanConfig`. Fields in a `FileScanConfig` such as Statistics represent
+/// information about the files **before** any projection or filtering is
Review Comment:
I think this is good. I still get confused about stats before vs. after
projection / filtering. I do have some caveats: for us we use an external index
to prune row groups, so we can actually get _more_ accurate statistics for
min/max, num rows, etc. in a file taking into account row groups pruned. It's
also useful to project the statistics (or at least only populate them for
columns that are in filters or projections); no point in collecting /
allocating statistics for columns that will never be used.
##########
datafusion/datasource/src/file_scan_config.rs:
##########
@@ -332,7 +350,12 @@ impl FileScanConfigBuilder {
/// Set the columns on which to project the data using column indices.
///
- /// Indexes that are higher than the number of columns of `file_schema`
refer to `table_partition_cols`.
+ /// This method attempts to push down the projection to the underlying file
+ /// source if supported. If the file source does not support projection
+ /// pushdown, an error is returned.
Review Comment:
I think we return `Ok(self)` -> not an error, an unchanged plan?
##########
datafusion/datasource/src/file.rs:
##########
@@ -64,24 +70,38 @@ pub trait FileSource: Send + Sync {
) -> Result<Arc<dyn FileOpener>>;
/// Any
fn as_any(&self) -> &dyn Any;
+
/// Returns the table schema for this file source.
///
- /// This always returns the unprojected schema (the full schema of the
data).
+ /// This always returns the unprojected schema (the full schema of the
data)
+ /// without [`Self::projection`] applied.
+ ///
+ /// The output schema of this `FileSource` is this TableSchema
+ /// with [`Self::projection`] applied.
Review Comment:
Wonder if we should add a helper method to `FileSource` that applies the
projection?
##########
datafusion/datasource/src/file.rs:
##########
@@ -135,6 +155,10 @@ pub trait FileSource: Send + Sync {
}
/// Try to push down filters into this FileSource.
+ ///
+ /// `filters` must be in terms of the unprojected table schema (file schema
+ /// plus partition columns).
Review Comment:
👍🏻
Worth clarifying what is expected for a FileSource that:
- Applies filters independently of projections
- Applies filters after reading the data / projecting
?
##########
datafusion/datasource/src/file.rs:
##########
@@ -46,6 +46,12 @@ pub fn as_file_source<T: FileSource + 'static>(source: T) ->
Arc<dyn FileSource>
/// file format specific behaviors for elements in [`DataSource`]
///
+/// # Schema information
+/// There are two important schemas for a [`FileSource`]:
+/// 1. [`Self::table_schema`] -- the schema for the overall "table"
Review Comment:
Does this include partition columns?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]