rdblue commented on pull request #1783: URL: https://github.com/apache/iceberg/pull/1783#issuecomment-731352161
I agree that we will need to mimic the behavior of `LookupCatalog` and use the catalog manager. I think we can fix some of the problems with this by using the current catalog from the catalog manager if there is no catalog identifier in the table name. Here's what I think is the _correct_ thing to do for table identifiers in Spark 3: 1. Parse the identifier into parts 2. Follow the logic in `LookupCatalog`: if there is only one part, use the current catalog and current namespace. If there are multiple parts, check whether the first one is a catalog and use it if it is. After this, we have a catalog and a table identifier. 3. Check that the catalog is an Iceberg catalog. If not, throw an exception. _Not entirely sure about this one._ 4. Return the catalog and identifier through [`SupportsCatalogOptions`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java). That way, we're just using the source to translate back to catalogs. The problem with this is the behavior when there is no catalog specified in the identifier. The current behavior is to use a `HiveCatalog` that connects to the URI from `hive-site.xml` like the built-in Spark session catalog. That conflicts with the SQL behavior of using the current catalog, but may be reasonable to keep compatibility. But then the problem is that we may not have a registered Iceberg catalog that uses that URI. If not, then there is no catalog for the source to delegate to and we need to create one because a catalog is required if the source implements `SupportsCatalogOptions`. Another option is to change the current behavior slightly and go with the "correct" logic for Spark 3 and delegate to the current catalog. That would mean we always have a catalog to delegate to without creating one. But the trade-off is that when the current catalog has changed, the table loaded by `IcebergSource` would change, too. I'd be open to this option. Last, how to handle path URIs: I talked with Anton and Russell about this yesterday and we think that we should make it so that every Spark catalog can load special path identifiers, just like we do today in `IcebergSource`. To do this, we would need to have a way to pass an identifier that signals this behavior back from `SupportsCatalogOptions`, like `iceberg.hdfs://nn:8020/path/to/table`. Then detect those and return Hadoop tables for them. What do you think? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
