aokolnychyi opened a new issue #4346: URL: https://github.com/apache/iceberg/issues/4346
There have been multiple attempts to make our `DeleteOrphanFiles` action more reliable. One such [discussion](https://the-asf.slack.com/archives/CF01LKV9S/p1601663482098100) happened more than a year ago. However, we never reached consensus. I will try to summarize my current thoughts but I encourage everyone to comment as well. ### Location Generation There are three location types in Iceberg. #### Table location Table locations are either provided by the user or defaulted in `TableOperations`. When defaulting, we currently manipulate raw strings via methods such as `String$format`. That means there is no normalization/validation for root table locations. #### Metadata Classes that extend `BaseMetastoreTableOperations` use `metadataFileLocation` to generate a new location for all types of metadata files. Under the hood, it simply uses `String$format` and has no location normalization. ``` private String metadataFileLocation(TableMetadata metadata, String filename) { String metadataLocation = metadata.properties().get(TableProperties.WRITE_METADATA_LOCATION); if (metadataLocation != null) { return String.format("%s/%s", metadataLocation, filename); } else { return String.format("%s/%s/%s", metadata.location(), METADATA_FOLDER_NAME, filename); } } ``` In `HadoopTableOperations`, we rely on `Path` instead of `String$format` as we have access to Hadoop classes. ``` private Path metadataPath(String filename) { return new Path(metadataRoot(), filename); } private Path metadataRoot() { return new Path(location, "metadata"); } ``` That means some normalization is happening for metadata file locations generated in `HadoopTableOperations`. #### Data Data file locations depend on `LocationProvider` returned by `TableOperations`. While users can inject a custom location provider, Iceberg has two built-in implementations: - DefaultLocationProvider - ObjectStoreLocationProvider Both built-in implementations use `String$format` and have no normalization/validation. ### Problem Right now, `DeleteOrphanFiles` uses Hadoop `FileSystem` to list all actual files in a given location and compares them to the locations stored in the metadata. As discussed above, Iceberg does not do any normalization for locations persisted in the metadata. That means locations retuned during listing may have cosmetic differences compared to locations stored in the metadata, even though both can point to the same files. As a consequence, `DeleteOrphanFiles` can corrupt a table. ### Proposed Approach - We cannot change what is already stored in the metadata so `DeleteOrphanFiles` should normalize locations of reachable files. Since we do listing via Hadoop `FileSystem`, we should probably leverage Hadoop classes for normalization to avoid surprises. For example, just constructing a new `Path` from a `String` normalizes the path part of the URI. ``` Path path = new Path("hdfs://localhost:8020/user//log/data///dummy_file/"); path.toString() // hdfs://localhost:8020/user/log/data/dummy_file ``` - Normalization is required but does not solve all issues. Since table locations are arbitrary, we may hit a few weird cases. - Data or metadata locations without a scheme and authority. - Changes in the Hadoop conf. We may have one set of configured file systems when the table was created and a completely different one when deleting orphans. For example, the scheme name can change (it is just a string), the authority can be represented via an IP address instead of a host name or multiple host names can be mapped into the same name node. - I am not sure whether it is possible but can someone migrate from `s3a` to `s3` or vice versa? - The action should expose options to ignore the scheme and authority during the comparison. If that happens, only normalized paths will be compared. - The location we are about to clean must be validated. If the action is configured to take scheme and authority into account, the provided location must have those set. In other words, it is illegal to provide a location without an authority if the action is supposed to compare authorities. - Locations persisted in the metadata without a scheme and authority must inherit those values from the location we scan for orphans, not from the current Hadoop conf. This essentially means we will only compare the normalized path for such locations. When it comes to possible implementations, we can call `mapPartitions` on `DataFrame` with locations. ``` Path path = new Path(location); // should normalize the path URI uri = path.toUri(); // should give us access to scheme, authority, path ... // run validation, inherit scheme and authority or ignore them if not needed Path newPath = new Path(newScheme, newAuthority, uri.getPath()) return newPath.toString(); ``` I know @karuppayya has been working on a fix so I wanted to make sure we build consensus first. cc @karuppayya @RussellSpitzer @rdblue @flyrain @szehon-ho @jackye1995 @pvary @openinx @rymurr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
