aokolnychyi opened a new issue #4346:
URL: https://github.com/apache/iceberg/issues/4346


   There have been multiple attempts to make our `DeleteOrphanFiles` action 
more reliable. One such 
[discussion](https://the-asf.slack.com/archives/CF01LKV9S/p1601663482098100) 
happened more than a year ago. However, we never reached consensus.
   
   I will try to summarize my current thoughts but I encourage everyone to 
comment as well.
   
   ### Location Generation
   
   There are three location types in Iceberg.
   
   #### Table location
   
   Table locations are either provided by the user or defaulted in 
`TableOperations`. When defaulting, we currently manipulate raw strings via 
methods such as `String$format`. That means there is no 
normalization/validation for root table locations.
   
   #### Metadata
   
   Classes that extend `BaseMetastoreTableOperations` use 
`metadataFileLocation` to generate a new location for all types of metadata 
files. Under the hood, it simply uses `String$format` and has no location 
normalization.
   
   ```
   private String metadataFileLocation(TableMetadata metadata, String filename) 
{
     String metadataLocation = 
metadata.properties().get(TableProperties.WRITE_METADATA_LOCATION);
   
     if (metadataLocation != null) {
       return String.format("%s/%s", metadataLocation, filename);
     } else {
       return String.format("%s/%s/%s", metadata.location(), 
METADATA_FOLDER_NAME, filename);
     }
   }
   ```
   
   In `HadoopTableOperations`, we rely on `Path` instead of `String$format` as 
we have access to Hadoop classes. 
   
   ```
   private Path metadataPath(String filename) {
     return new Path(metadataRoot(), filename);
   }
   
   private Path metadataRoot() {
     return new Path(location, "metadata");
   }
   ```
   
   That means some normalization is happening for metadata file locations 
generated in `HadoopTableOperations`.
   
   #### Data
   
   Data file locations depend on `LocationProvider` returned by 
`TableOperations`. While users can inject a custom location provider, Iceberg 
has two built-in implementations:
   
   - DefaultLocationProvider
   - ObjectStoreLocationProvider
   
   Both built-in implementations use `String$format` and have no 
normalization/validation.
   
   ### Problem
   
   Right now, `DeleteOrphanFiles` uses Hadoop `FileSystem` to list all actual 
files in a given location and compares them to the locations stored in the 
metadata. As discussed above, Iceberg does not do any normalization for 
locations persisted in the metadata. That means locations retuned during 
listing may have cosmetic differences compared to locations stored in the 
metadata, even though both can point to the same files. As a consequence,  
`DeleteOrphanFiles` can corrupt a table.
   
   ### Proposed Approach
   
   - We cannot change what is already stored in the metadata so 
`DeleteOrphanFiles` should normalize locations of reachable files. Since we do 
listing via Hadoop `FileSystem`, we should probably leverage Hadoop classes for 
normalization to avoid surprises. For example, just constructing a new `Path` 
from a `String` normalizes the path part of the URI.
   
   ```
   Path path = new Path("hdfs://localhost:8020/user//log/data///dummy_file/");
   path.toString() // hdfs://localhost:8020/user/log/data/dummy_file
   ```
   
   - Normalization is required but does not solve all issues. Since table 
locations are arbitrary, we may hit a few weird cases.
       - Data or metadata locations without a scheme and authority.
       - Changes in the Hadoop conf. We may have one set of configured file 
systems when the table was created and a completely different one when deleting 
orphans. For example, the scheme name can change (it is just a string), the 
authority can be represented via an IP address instead of a host name or 
multiple host names can be mapped into the same name node.
       - I am not sure whether it is possible but can someone migrate from 
`s3a` to `s3` or vice versa? 
   
   - The action should expose options to ignore the scheme and authority during 
the comparison. If that happens, only normalized paths will be compared.
   - The location we are about to clean must be validated. If the action is 
configured to take scheme and authority into account, the provided location 
must have those set. In other words, it is illegal to provide a location 
without an authority if the action is supposed to compare authorities.
   - Locations persisted in the metadata without a scheme and authority must 
inherit those values from the location we scan for orphans, not from the 
current Hadoop conf. This essentially means we will only compare the normalized 
path for such locations.
   
   When it comes to possible implementations, we can call `mapPartitions` on 
`DataFrame` with locations.
   
   ```
   Path path = new Path(location); // should normalize the path
   URI uri = path.toUri(); // should give us access to scheme, authority, path
   ... // run validation, inherit scheme and authority or ignore them if not 
needed
   Path newPath = new Path(newScheme, newAuthority, uri.getPath())
   return newPath.toString();
   ```
   
   I know @karuppayya has been working on a fix so I wanted to make sure we 
build consensus first.
   
   cc @karuppayya @RussellSpitzer @rdblue @flyrain @szehon-ho @jackye1995 
@pvary @openinx @rymurr


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to