rdblue commented on pull request #1783:
URL: https://github.com/apache/iceberg/pull/1783#issuecomment-731352161


   I agree that we will need to mimic the behavior of `LookupCatalog` and use 
the catalog manager. I think we can fix some of the problems with this by using 
the current catalog from the catalog manager if there is no catalog identifier 
in the table name.
   
   Here's what I think is the _correct_ thing to do for table identifiers in 
Spark 3:
   1. Parse the identifier into parts
   2. Follow the logic in `LookupCatalog`: if there is only one part, use the 
current catalog and current namespace. If there are multiple parts, check 
whether the first one is a catalog and use it if it is. After this, we have a 
catalog and a table identifier.
   3. Check that the catalog is an Iceberg catalog. If not, throw an exception. 
_Not entirely sure about this one._
   4. Return the catalog and identifier through 
[`SupportsCatalogOptions`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java).
   
   That way, we're just using the source to translate back to catalogs.
   
   The problem with this is the behavior when there is no catalog specified in 
the identifier. The current behavior is to use a `HiveCatalog` that connects to 
the URI from `hive-site.xml` like the built-in Spark session catalog. That 
conflicts with the SQL behavior of using the current catalog, but may be 
reasonable to keep compatibility. But then the problem is that we may not have 
a registered Iceberg catalog that uses that URI. If not, then there is no 
catalog for the source to delegate to and we need to create one because a 
catalog is required if the source implements `SupportsCatalogOptions`.
   
   Another option is to change the current behavior slightly and go with the 
"correct" logic for Spark 3 and delegate to the current catalog. That would 
mean we always have a catalog to delegate to without creating one. But the 
trade-off is that when the current catalog has changed, the table loaded by 
`IcebergSource` would change, too. I'd be open to this option.
   
   Last, how to handle path URIs: I talked with Anton and Russell about this 
yesterday and we think that we should make it so that every Spark catalog can 
load special path identifiers, just like we do today in `IcebergSource`. To do 
this, we would need to have a way to pass an identifier that signals this 
behavior back from `SupportsCatalogOptions`, like 
`iceberg.hdfs://nn:8020/path/to/table`. Then detect those and return Hadoop 
tables for them.
   
   What do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to