szehon-ho commented on code in PR #4629:
URL: https://github.com/apache/iceberg/pull/4629#discussion_r861445948
##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java:
##########
@@ -176,16 +186,35 @@ protected Dataset<Row> loadMetadataTable(Table table,
MetadataTableType type) {
return SparkTableUtil.loadMetadataTable(spark, table, type);
}
- private static class ReadManifest implements
FlatMapFunction<ManifestFileBean, String> {
- private final Broadcast<FileIO> io;
+ private static class ReadManifest implements
FlatMapFunction<ManifestFileBean, Tuple2<String, String>> {
+ private final Broadcast<Table> table;
- ReadManifest(Broadcast<FileIO> io) {
- this.io = io;
+ ReadManifest(Broadcast<Table> table) {
+ this.table = table;
}
@Override
- public Iterator<String> call(ManifestFileBean manifest) {
- return new ClosingIterator<>(ManifestFiles.readPaths(manifest,
io.getValue()).iterator());
+ public Iterator<Tuple2<String, String>> call(ManifestFileBean manifest) {
+ return new ClosingIterator(entries(manifest));
+ }
+
+ public CloseableIterator<Tuple2<String, String>> entries(ManifestFileBean
manifest) {
+ switch (manifest.content()) {
+ case DATA:
Review Comment:
Good catch, this seems it is working by chance, ManifestFiles.read() only
checks to make sure content=Data, but doesn't use the content anywhere else in
this path. The file returned can still be delete or data file.
Not sure if it's worth to implement manifests.content field correctly (I
guess that's what was done in the original pr: #2518). if there is not much
use now, maybe we can do it in a separately and here just add a new method to
ManifestFiles.files() that returns a Pair of type/path, similar to
ManifestFiles.paths(). @aokolnychyi any preference?
##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/actions/BaseSparkAction.java:
##########
@@ -176,16 +186,35 @@ protected Dataset<Row> loadMetadataTable(Table table,
MetadataTableType type) {
return SparkTableUtil.loadMetadataTable(spark, table, type);
}
- private static class ReadManifest implements
FlatMapFunction<ManifestFileBean, String> {
- private final Broadcast<FileIO> io;
+ private static class ReadManifest implements
FlatMapFunction<ManifestFileBean, Tuple2<String, String>> {
+ private final Broadcast<Table> table;
- ReadManifest(Broadcast<FileIO> io) {
- this.io = io;
+ ReadManifest(Broadcast<Table> table) {
+ this.table = table;
}
@Override
- public Iterator<String> call(ManifestFileBean manifest) {
- return new ClosingIterator<>(ManifestFiles.readPaths(manifest,
io.getValue()).iterator());
+ public Iterator<Tuple2<String, String>> call(ManifestFileBean manifest) {
+ return new ClosingIterator(entries(manifest));
+ }
+
+ public CloseableIterator<Tuple2<String, String>> entries(ManifestFileBean
manifest) {
+ switch (manifest.content()) {
+ case DATA:
Review Comment:
Good catch, this seems it is working by chance, ManifestFiles.read() only
checks to make sure content=Data, but doesn't use the content anywhere else in
this path. The file returned can still be delete or data file.
Not sure if it's worth to implement manifests.content field correctly (I
guess that's what was done in the original pr: #2518). if there is not much
use now, maybe we can here just add a new method to ManifestFiles.files() that
returns a Pair of type/path, similar to ManifestFiles.paths(). @aokolnychyi
any preference?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]