szehon-ho commented on a change in pull request #3365:
URL: https://github.com/apache/iceberg/pull/3365#discussion_r736041815
##########
File path: flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java
##########
@@ -267,17 +268,28 @@ public static StructLikeSet actualRowSet(Table table,
Long snapshotId, String...
public static Map<Long, List<DataFile>> snapshotToDataFiles(Table table)
throws IOException {
table.refresh();
+
Map<Long, List<DataFile>> result = Maps.newHashMap();
- List<ManifestFile> manifestFiles = table.currentSnapshot().dataManifests();
- for (ManifestFile manifestFile : manifestFiles) {
- try (ManifestReader<DataFile> reader = ManifestFiles.read(manifestFile,
table.io())) {
- List<DataFile> dataFiles = Lists.newArrayList(reader);
- if (result.containsKey(manifestFile.snapshotId())) {
- result.get(manifestFile.snapshotId()).addAll(dataFiles);
- } else {
- result.put(manifestFile.snapshotId(), dataFiles);
- }
+ Snapshot current = table.currentSnapshot();
Review comment:
I get what you mean. If you are curious, a real life use case for
ManifestEntry we designed was a bit similar to this.
We were trying to build a Data Latency monitoring application that measures
max data latency per partition. So we wanted to go through snapshots , then
search reachable manifest entries for all ADDED data files matching a
partition, and find the latest commit time from them all.
We end up trying to join snapshot + all_entries metadata tables, but due to
perf issues and bugs with metadata tables aggregation, started to look at Table
and ManifestReader API to explore a set of known snapshots/manifest files
directly, as we kind of knew what time frame the partition was expected to land
at latest. But without knowing the context of DataFile, this solution is hard
to get working (DataFile in EXISTING mode resulting from metadata rewrite
throws it off).
Not sure if there was an easier way to do it :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]