RussellSpitzer commented on code in PR #15426:
URL: https://github.com/apache/iceberg/pull/15426#discussion_r2848051519


##########
core/src/main/java/org/apache/iceberg/ManifestGroup.java:
##########
@@ -222,13 +222,23 @@ public <T extends ScanTask> CloseableIterable<T> 
plan(CreateTasksFunction<T> cre
   /**
    * Returns an iterable for manifest entries in the set of manifests.
    *
-   * <p>Entries are not copied and it is the caller's responsibility to make 
defensive copies if
-   * adding these entries to a collection.
+   * <p>When parallel execution is enabled via {@link 
#planWith(ExecutorService)}, entries are
+   * defensively copied since the underlying {@link ManifestReader} reuses 
entry objects during
+   * iteration. Otherwise, entries are not copied and it is the caller's 
responsibility to make
+   * defensive copies if adding these entries to a collection.
    *
    * @return a CloseableIterable of manifest entries.
    */
   public CloseableIterable<ManifestEntry<DataFile>> entries() {
-    return CloseableIterable.concat(entries((manifest, entries) -> entries));
+    if (executorService != null) {
+      // copy entries to avoid object reuse issues when scanning manifests in 
parallel,
+      // as ManifestReader reuses entry objects during iteration
+      Iterable<CloseableIterable<ManifestEntry<DataFile>>> entryIterables =
+          entries((manifest, entries) -> CloseableIterable.transform(entries, 
ManifestEntry::copy));

Review Comment:
   I'm a little worried about introducing a sort of hidden "materialize 
everything" setting. 
   
   We have one internal use where this could come up and we are effectively 
double copying with this change
   
   ## FindFiles
   
   
https://github.com/apache/iceberg/blob/84172252a1f2537549e52b174299cbf715138760/core/src/main/java/org/apache/iceberg/FindFiles.java#L201-L223
   
   So not even counting external users of this library we are already 
introducing a regression. I'm not saying we shouldn't do this but we should be 
very careful here to avoid accidentally doubling memory consumption (or greatly 
increasing consumption) of a downstream consumer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to