ericlgoodman commented on PR #5331:
URL: https://github.com/apache/iceberg/pull/5331#issuecomment-1201826444
Adding here my primary concern with this PR - and in general a concern going
forward with Spark and using multiple tables such as Delta Lake and Iceberg.
Spark reads tables through whatever catalog is located at the first part of
a table's identifier. There can only be 1 catalog per identifier, and different
catalogs have different capabilities. For example, the `DeltaCatalog` can read
Delta Lake and generic Hive tables, and the `SparkSessionCatalog` can read
Iceberg + Hive tables.
In theory, in order to read from multiple table types in one Spark session,
a user would initialize a `DeltaCatalog`, at say, `delta` and then the
`SparkSessionCatalog` at `iceberg`. Then all their Delta Lake tables would be
located at `delta.my_delta_database.my_delta_lake_table` and all their Iceberg
tables at `iceberg.my_iceberg_database.my_iceberg_table`. Unfortunately, this
doesn't work out of the box. Both of these catalog implementations are designed
to be used by overriding the default Spark catalog, which is located at
`spark_catalog`. `CatalogExtension`, from which `DeltaCatalog` and
`SparkSessionCatalog` both inherit from, contains a method
`setDelegateCatalog(CatalogPlugin delegate)`. As the Javadoc reads:
```java
/**
* This will be called only once by Spark to pass in the Spark built-in
session catalog, after
* {@link #initialize(String, CaseInsensitiveStringMap)} is called.
*/
void setDelegateCatalog(CatalogPlugin delegate);
```
A user can fix this issue by manually calling this method during Spark setup
and setting the delegate to the one in the default Spark catalog. But most
users presumably are not doing this, and some users might face difficulty
depending on their service provider and how much abstraction/configuration has
been taken away from them during setup.
This basically means that in today's world, it doesn't seem realistic that
users currently have a simple way to use one Spark session to read/migrate
between different table types. This solution might make make sense to implement
first, as users may find that a Delta/Iceberg/Hudi table makes sense for them
in one context but another one is preferable in another.
When it comes to migration, there are basically two options:
1. Create a more abstract Catalog implementation that can read
Iceberg/Delta/Hudi/Hive tables dynamically, similar to what happens in the
Trino Hive connector. The connector inspects the table properties and
determines at runtime whether to redirect to another connector. Similarly, a
Spark catalog could simply delegate to specific catalogs if it sees certain
table type specific properties.
2. Provide an easier method for users to not have to override the default
catalog for these table type specific catalog implementations. If the Delta
catalog was located at `delta`, and Iceberg at `iceberg`, then users could just
keep their different table types in different catalogs and migration could take
an optional parameter of the new desired catalog.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]