[GitHub] [iceberg] rdblue commented on pull request #1783: Custom catalogs from `IcebergSource`

GitBox Fri, 20 Nov 2020 11:02:11 -0800


rdblue commented on pull request #1783:
URL: https://github.com/apache/iceberg/pull/1783#issuecomment-731352161

I agree that we will need to mimic the behavior of `LookupCatalog` and use
the catalog manager. I think we can fix some of the problems with this by using
the current catalog from the catalog manager if there is no catalog identifier
in the table name.

Here's what I think is the _correct_ thing to do for table identifiers in
Spark 3:
1. Parse the identifier into parts
2. Follow the logic in `LookupCatalog`: if there is only one part, use the
current catalog and current namespace. If there are multiple parts, check
whether the first one is a catalog and use it if it is. After this, we have a
catalog and a table identifier.
3. Check that the catalog is an Iceberg catalog. If not, throw an exception.
_Not entirely sure about this one._
4. Return the catalog and identifier through
[`SupportsCatalogOptions`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java).

That way, we're just using the source to translate back to catalogs.

The problem with this is the behavior when there is no catalog specified in
the identifier. The current behavior is to use a `HiveCatalog` that connects to
the URI from `hive-site.xml` like the built-in Spark session catalog. That
conflicts with the SQL behavior of using the current catalog, but may be
reasonable to keep compatibility. But then the problem is that we may not have
a registered Iceberg catalog that uses that URI. If not, then there is no
catalog for the source to delegate to and we need to create one because a
catalog is required if the source implements `SupportsCatalogOptions`.

Another option is to change the current behavior slightly and go with the
"correct" logic for Spark 3 and delegate to the current catalog. That would
mean we always have a catalog to delegate to without creating one. But the
trade-off is that when the current catalog has changed, the table loaded by
`IcebergSource` would change, too. I'd be open to this option.

Last, how to handle path URIs: I talked with Anton and Russell about this
yesterday and we think that we should make it so that every Spark catalog can
load special path identifiers, just like we do today in `IcebergSource`. To do
this, we would need to have a way to pass an identifier that signals this
behavior back from `SupportsCatalogOptions`, like
`iceberg.hdfs://nn:8020/path/to/table`. Then detect those and return Hadoop
tables for them.

What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1783: Custom catalogs from `IcebergSource`

Reply via email to