rdblue commented on a change in pull request #735: Refactor FilteredManifest 
and ManifestGroup
URL: https://github.com/apache/incubator-iceberg/pull/735#discussion_r370291337
 
 

 ##########
 File path: core/src/main/java/org/apache/iceberg/FilteredManifest.java
 ##########
 @@ -95,60 +94,38 @@ public FilteredManifest caseSensitive(boolean 
isCaseSensitive) {
       Evaluator evaluator = evaluator();
       InclusiveMetricsEvaluator metricsEvaluator = metricsEvaluator();
 
-      return CloseableIterable.filter(reader.entries(projection(fileSchema, 
columns, caseSensitive)),
+      // ensure stats columns are present for metrics evaluation
+      boolean requireStatsProjection = requireStatsProjection();
+      Collection<String> projectColumns = requireStatsProjection ? 
withStatsColumns(columns) : columns;
+
+      CloseableIterable<ManifestEntry> entries = CloseableIterable.filter(
+          reader.entries(projection(fileSchema, projectColumns, 
caseSensitive)),
           entry -> entry != null &&
               evaluator.eval(entry.file().partition()) &&
               metricsEvaluator.eval(entry.file()));
 
+      // note that columns itself could have stats projected, that's ok.
+      // We only drop stats if we have forced stats projection.
+      return CloseableIterable.transform(entries, e -> requireStatsProjection 
? e.copyWithoutStats() : e);
 
 Review comment:
   For performance, I think it is worse to copy here than to return the extra 
columns. We don't copy because we assume that the caller will use the row 
immediately -- e.g. to aggregate stats -- or will make a defensive copy. The 
caller is responsible to decide. That really cuts down on object allocation and 
helps performance. Copying just to get rid of stats columns forces a copy, when 
the caller may not need one.
   
   Let's just return what was read here. In `iterator`, where we are copying 
anyway we can do the copy with or without stats.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to