RussellSpitzer commented on code in PR #15426:
URL: https://github.com/apache/iceberg/pull/15426#discussion_r2848051519
##########
core/src/main/java/org/apache/iceberg/ManifestGroup.java:
##########
@@ -222,13 +222,23 @@ public <T extends ScanTask> CloseableIterable<T>
plan(CreateTasksFunction<T> cre
/**
* Returns an iterable for manifest entries in the set of manifests.
*
- * <p>Entries are not copied and it is the caller's responsibility to make
defensive copies if
- * adding these entries to a collection.
+ * <p>When parallel execution is enabled via {@link
#planWith(ExecutorService)}, entries are
+ * defensively copied since the underlying {@link ManifestReader} reuses
entry objects during
+ * iteration. Otherwise, entries are not copied and it is the caller's
responsibility to make
+ * defensive copies if adding these entries to a collection.
*
* @return a CloseableIterable of manifest entries.
*/
public CloseableIterable<ManifestEntry<DataFile>> entries() {
- return CloseableIterable.concat(entries((manifest, entries) -> entries));
+ if (executorService != null) {
+ // copy entries to avoid object reuse issues when scanning manifests in
parallel,
+ // as ManifestReader reuses entry objects during iteration
+ Iterable<CloseableIterable<ManifestEntry<DataFile>>> entryIterables =
+ entries((manifest, entries) -> CloseableIterable.transform(entries,
ManifestEntry::copy));
Review Comment:
I'm a little worried about introducing a sort of hidden "materialize
everything" setting.
We have one internal use where this could come up and we are effectively
double copying with this change
## FindFiles
https://github.com/apache/iceberg/blob/84172252a1f2537549e52b174299cbf715138760/core/src/main/java/org/apache/iceberg/FindFiles.java#L201-L223
So not even counting external users of this library we are already
introducing a regression. I'm not saying we shouldn't do this but we should be
very careful here to avoid accidentally doubling memory consumption (or greatly
increasing consumption) of a downstream consumer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]