rdblue commented on code in PR #5268:
URL: https://github.com/apache/iceberg/pull/5268#discussion_r922363887


##########
core/src/main/java/org/apache/iceberg/BaseTableScan.java:
##########
@@ -116,9 +133,20 @@ public CloseableIterable<FileScanTask> planFiles() {
           ExpressionUtil.toSanitizedString(filter()));
 
       Listeners.notifyAll(new ScanEvent(table().name(), snapshot.snapshotId(), 
filter(), schema()));
-
-      return doPlanFiles();
-
+      scanReporter().startScan();
+      Timer.Sample sample = 
scanReporter().scanMetrics().totalScanDuration().start();

Review Comment:
   Tying the lifecycle of scan metrics to the scan reporter means that all of 
the scans created through this chain will have the same metrics collector, 
which is a correctness problem.
   
   In the Java library, the `TableScan` interface uses a refinement pattern. 
Each `TableScan` is independent and you get a new table scan that has 
additional options when you call the refinement methods, like `select` or 
`filter`. For example,
   
   ```java
   TableScan fullTableScan = table.newScan();
   TableScan yesterday = fullTableScan.filter(Expressions.equal("date_col", 
yesterday()));
   TableScan loginsYesterday = yesterday.filter(Expressions.equal("event_type", 
"login"));
   
   CloseableIterable<FileScanTask> yesterdayTasks = yesterday.planFiles();
   CloseableIterable<FileScanTask> loginTasks = loginsYesterday.planFiles();
   ```
   
   Two scans are planned, but the metrics for both would use the same 
`ScanMetrics`.
   
   Above, I suggested separating `ScanMetrics` from the reporter, which would 
solve this problem. Not all scans are planned, so you wouldn't want to eagerly 
create metrics, but the ones that are used should have a new one:
   
   ```java
     protected ScanMetrics metrics() {
       if (scanMetrics == null) {
         this.scanMetrics = scanReporter.newScanMetrics();
       }
   
       return scanMetrics;
     }
   
     public CloseableIterable<FileScanTask> planFiles(...) {
       ScanMetrics metrics = metrics();
       ...
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to