[GitHub] [iceberg] ericlgoodman commented on pull request #5331: WIP: Adding support for Delta to Iceberg migration

GitBox Mon, 01 Aug 2022 16:11:43 -0700


ericlgoodman commented on PR #5331:
URL: https://github.com/apache/iceberg/pull/5331#issuecomment-1201826444


   Adding here my primary concern with this PR - and in general a concern going 
forward with Spark and using multiple tables such as Delta Lake and Iceberg.
   
   Spark reads tables through whatever catalog is located at the first part of 
a table's identifier. There can only be 1 catalog per identifier, and different 
catalogs have different capabilities. For example, the `DeltaCatalog` can read 
Delta Lake and generic Hive tables, and the `SparkSessionCatalog` can read 
Iceberg + Hive tables.
   
   In theory, in order to read from multiple table types in one Spark session, 
a user would initialize a `DeltaCatalog`, at say, `delta` and then the 
`SparkSessionCatalog` at `iceberg`. Then all their Delta Lake tables would be 
located at `delta.my_delta_database.my_delta_lake_table` and all their Iceberg 
tables at `iceberg.my_iceberg_database.my_iceberg_table`. Unfortunately, this 
doesn't work out of the box. Both of these catalog implementations are designed 
to be used by overriding the default Spark catalog, which is located at 
`spark_catalog`. `CatalogExtension`, from which `DeltaCatalog` and 
`SparkSessionCatalog` both inherit from, contains a method 
`setDelegateCatalog(CatalogPlugin delegate)`. As the Javadoc reads:
   
   ```java
    /**
      * This will be called only once by Spark to pass in the Spark built-in 
session catalog, after
      * {@link #initialize(String, CaseInsensitiveStringMap)} is called.
      */
     void setDelegateCatalog(CatalogPlugin delegate);
   ```
   
   A user can fix this issue by manually calling this method during Spark setup 
and setting the delegate to the one in the default Spark catalog. But most 
users presumably are not doing this, and some users might face difficulty 
depending on their service provider and how much abstraction/configuration has 
been taken away from them during setup.
   
   This basically means that in today's world, it doesn't seem realistic that 
users currently have a simple way to use one Spark session to read/migrate 
between different table types. This solution might make make sense to implement 
first, as users may find that a Delta/Iceberg/Hudi table makes sense for them 
in one context but another one is preferable in another.
   
   When it comes to migration, there are basically two options:
   1. Create a more abstract Catalog implementation that can read 
Iceberg/Delta/Hudi/Hive tables dynamically, similar to what happens in the 
Trino Hive connector. The connector inspects the table properties and 
determines at runtime whether to redirect to another connector. Similarly, a 
Spark catalog could simply delegate to specific catalogs if it sees certain 
table type specific properties.
   2. Provide an easier method for users to not have to override the default 
catalog for these table type specific catalog implementations. If the Delta 
catalog was located at `delta`, and Iceberg at `iceberg`, then users could just 
keep their different table types in different catalogs and migration could take 
an optional parameter of the new desired catalog.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ericlgoodman commented on pull request #5331: WIP: Adding support for Delta to Iceberg migration

Reply via email to