FrankChen021 commented on code in PR #19449:
URL: https://github.com/apache/druid/pull/19449#discussion_r3226691456
##########
extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergCatalog.java:
##########
@@ -88,15 +92,27 @@ public List<String> extractSnapshotDataFiles(
try {
Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
TableIdentifier icebergTableIdentifier =
catalog.listTables(namespace).stream()
- .filter(tableId ->
tableId.toString().equals(tableIdentifier))
- .findFirst()
- .orElseThrow(() -> new
IAE(
- " Couldn't retrieve
table identifier for '%s'. Please verify that the table exists in the given
catalog",
- tableIdentifier
- ));
+ .filter(tableId -> tableId.toString().equals(tableIdentifier))
+ .findFirst()
+ .orElseThrow(() -> new IAE(
+ " Couldn't retrieve table identifier for '%s'. Please
verify that the table exists in the given catalog",
+ tableIdentifier
+ ));
long start = System.currentTimeMillis();
- TableScan tableScan =
catalog.loadTable(icebergTableIdentifier).newScan();
+ Table table = catalog.loadTable(icebergTableIdentifier);
+ TableScan tableScan = table.newScan();
+
+ if (columnsFilter != null) {
+ List<String> projectedColumns = table
+ .schema()
+ .columns()
+ .stream()
+ .map(Types.NestedField::name)
+ .filter(columnsFilter::apply)
+ .collect(Collectors.toList());
+ tableScan = tableScan.select(new ArrayList<>(projectedColumns));
Review Comment:
[P2] Projection is discarded before data is read
This selects projected columns on the Iceberg TableScan, but the method only
returns task.file().location() afterward. The projected FileScanTask schema is
discarded, and IcebergInputSource builds the warehouse delegate from the same
raw file paths, so Druid's Parquet reader still opens the original files
without the Iceberg projection. The new test also manually projects with
Parquet.read(...).project(...), so it would pass even if this select had no
effect. To make column projection work, the projected schema/split information
needs to be carried into the reader path or pruning needs to happen in the
delegate input format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]