szehon-ho commented on a change in pull request #3365:
URL: https://github.com/apache/iceberg/pull/3365#discussion_r736041815



##########
File path: flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java
##########
@@ -267,17 +268,28 @@ public static StructLikeSet actualRowSet(Table table, 
Long snapshotId, String...
 
   public static Map<Long, List<DataFile>> snapshotToDataFiles(Table table) 
throws IOException {
     table.refresh();
+
     Map<Long, List<DataFile>> result = Maps.newHashMap();
-    List<ManifestFile> manifestFiles = table.currentSnapshot().dataManifests();
-    for (ManifestFile manifestFile : manifestFiles) {
-      try (ManifestReader<DataFile> reader = ManifestFiles.read(manifestFile, 
table.io())) {
-        List<DataFile> dataFiles = Lists.newArrayList(reader);
-        if (result.containsKey(manifestFile.snapshotId())) {
-          result.get(manifestFile.snapshotId()).addAll(dataFiles);
-        } else {
-          result.put(manifestFile.snapshotId(), dataFiles);
-        }
+    Snapshot current = table.currentSnapshot();

Review comment:
       I get what you mean.  If you are curious, a real life use case for 
ManifestEntry we designed was a bit similar to this.  
   
   We were trying to build a Data Latency monitoring application that measures 
max data latency per partition.  So we wanted to go through snapshots , then 
search reachable manifest entries for all ADDED data files matching a 
partition, and find the latest commit time from them all.
   
   We end up trying to join snapshot + all_entries metadata tables, but due to 
perf issues and bugs with metadata tables aggregation, started to look at Table 
and ManifestReader API to explore a set of known snapshots/manifest files 
directly, as we kind of knew what time frame the partition was expected to land 
at latest.  But without knowing the context of DataFile, this solution is hard 
to get working (DataFile in EXISTING mode resulting from metadata rewrite 
throws it off).
   
   Not sure if there was an easier way to do it :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to