toutane commented on code in PR #2298:
URL: https://github.com/apache/iceberg-rust/pull/2298#discussion_r3421035832
##########
crates/integrations/datafusion/src/table/mod.rs:
##########
@@ -124,26 +128,93 @@ impl TableProvider for IcebergTableProvider {
async fn scan(
&self,
- _state: &dyn Session,
+ state: &dyn Session,
projection: Option<&Vec<usize>>,
filters: &[Expr],
limit: Option<usize>,
) -> DFResult<Arc<dyn ExecutionPlan>> {
- // Load fresh table metadata from catalog
+ // Second load: fetch the latest snapshot so scans always reflect
current table state.
let table = self
.catalog
.load_table(&self.table_ident)
.await
.map_err(to_datafusion_error)?;
- // Create scan with fresh metadata (always use current snapshot)
- Ok(Arc::new(IcebergTableScan::new(
+ // Build a TableScan mirroring the inputs we'll hand to
IcebergTableScan,
+ // so plan_files() uses the same projection/filters the scan will
replay in execute().
+ let col_names = projection.map(|indices| {
+ indices
+ .iter()
+ .map(|&i| self.schema.field(i).name().clone())
+ .collect::<Vec<_>>()
+ });
+
+ let predicate = convert_filters_to_predicate(filters);
+
+ let mut builder = table.scan();
+ builder = match col_names {
+ Some(names) => builder.select(names),
+ None => builder.select_all(),
+ };
+ if let Some(pred) = predicate {
+ builder = builder.with_filter(pred);
+ }
+
+ let tasks: Vec<FileScanTask> = builder
+ .build()
+ .map_err(to_datafusion_error)?
+ .plan_files()
Review Comment:
Yes, repeated `scan()` calls currently re-plan. The eager plan is cached
only inside the returned `IcebergTableScan`, not across separate provider
`scan()` calls.
I agree caching planned tasks is the right follow-up. I’d prefer to keep
this PR scoped to moving planning/bucketing into the existing provider and
declaring partitioning correctly, then add a follow-up cache keyed by concrete
snapshot id, projection, filter, and target partitions. We should still reload
the table first so snapshot changes naturally miss the cache and preserve the
catalog-backed provider’s “latest snapshot per scan” behavior.
I’ll track this as follow-up work unless you think it should block this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]