rdblue commented on pull request #1843:
URL: https://github.com/apache/iceberg/pull/1843#issuecomment-736749862
I agree with the approach to use `HadoopTables` to return tables when a path
identifier is found. What I'm not sure about is how to pass a path as an
identifier. Supporting a namespace element that has the file format is useful
for imports, but not for the `IcebergSource` because I don't think that we want
the source to support those identifiers.
For table imports, I think we have more options because we're controlling
the parsed statement or the stored procedure. That procedure could use optional
arguments like `file_format` or `location`. Similarly, a parsed statement can
be transformed with rules to pass the reference as something other than an
`Identifier`. What is relevant here is that we have options for those cases, so
I think we should focus in this PR on how to pass paths from `IcebergSource` to
a catalog, and loading in that catalog.
Looking at the Spark, the identifier is immediately used to load the table,
or is added to a CTAS plan. I don't think that the identifier is modified after
it is returned. We could use a custom `Identifier` implementation,
`PathIdentifier`, to signal that a path should be loaded.
```java
public class PathIdentifier implements Identifier {
private final String location;
private final String name;
public PathIdentifier(String location) {
this.location = location;
this.name =
Iterables.getLastElement(Splitter.on("/").splitToList(location));
}
public String location() {
return location;
}
public String namespace() {
return new String[] { location };
}
public String name() {
return name;
}
}
```
This uses the last part of the location string as the table name, so that
the default subquery aliases added in Spark work.
Then each method in `SparkCatalog` would just need to check `instanceof
PathIdentifier`, which is a bit cleaner than the `/` check -- although that
would still need to be done in `IcebergSource`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]