[GitHub] [iceberg] rdblue commented on pull request #1843: Support for file paths in SparkCatalogs via HadoopTables

GitBox Tue, 01 Dec 2020 10:54:10 -0800


rdblue commented on pull request #1843:
URL: https://github.com/apache/iceberg/pull/1843#issuecomment-736749862



   I agree with the approach to use `HadoopTables` to return tables when a path 
identifier is found. What I'm not sure about is how to pass a path as an 
identifier. Supporting a namespace element that has the file format is useful 
for imports, but not for the `IcebergSource` because I don't think that we want 
the source to support those identifiers.
   
   For table imports, I think we have more options because we're controlling 
the parsed statement or the stored procedure. That procedure could use optional 
arguments like `file_format` or `location`. Similarly, a parsed statement can 
be transformed with rules to pass the reference as something other than an 
`Identifier`. What is relevant here is that we have options for those cases, so 
I think we should focus in this PR on how to pass paths from `IcebergSource` to 
a catalog, and loading in that catalog.
   
   Looking at the Spark, the identifier is immediately used to load the table, 
or is added to a CTAS plan. I don't think that the identifier is modified after 
it is returned. We could use a custom `Identifier` implementation, 
`PathIdentifier`, to signal that a path should be loaded.
   
   ```java
   public class PathIdentifier implements Identifier {
     private final String location;
     private final String name;
   
     public PathIdentifier(String location) {
       this.location = location;
       this.name = 
Iterables.getLastElement(Splitter.on("/").splitToList(location));
     }
   
     public String location() {
       return location;
     }
   
     public String namespace() {
       return new String[] { location };
     }
   
     public String name() {
       return name;
     }
   }
   ```
   
   This uses the last part of the location string as the table name, so that 
the default subquery aliases added in Spark work.
   
   Then each method in `SparkCatalog` would just need to check `instanceof 
PathIdentifier`, which is a bit cleaner than the `/` check -- although that 
would still need to be done in `IcebergSource`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1843: Support for file paths in SparkCatalogs via HadoopTables

Reply via email to