aokolnychyi commented on pull request #1525:
URL: https://github.com/apache/iceberg/pull/1525#issuecomment-704410239
I did one pass through the code and have a few questions/notes:
**Iceberg vs Spark catalogs**
I think it is reasonable to use the Spark catalog API instead of the Iceberg
catalog API for the actual migration. I also like the idea of using staged
tables. At the same time, I am not sure `stageCreateOrReplace` is the right
semantics. I believe `replace` in case of Iceberg means replacing an Iceberg
table with another Iceberg table. To me, `stageCreate` sounds more appropriate
in this case.
**CreateActions vs Actions**
I understand that this API targets Spark 3 while the Actions API is in the
common module. Would it make sense to keep `BaseActions` in the shared module
and have separate `Actions` in spark2 and spark3 modules? That way, we may add
static methods like `snapshot` and `migrate` to the `Actions` in spark3. In
spark2, there will no extra methods.
**Session Catalog**
Looks like this logic works only with the session catalog. It may be a
reasonable limitation but I want to think through the use cases we want to
support, especially around path-based tables (#1306).
I consider this set of use cases as a minimum of what we should support for
`SNAPSHOT` and `MIGRATE` commands:
```
// Case 1.1: snapshot a location using HadoopTables (or a path-based
catalog) for target
SNAPSHOT TABLE parquet.`path/to/table`
USING iceberg
LOCATION `path/to/another/location`
TBLPROPERTIES (
'key' 'value'
)
// Case 1.2: snapshot a table in HMS using HiveCatalog for target
SNAPSHOT TABLE hive_table AS iceberg_hive.table_name
USING iceberg
LOCATION `path/to/another/location`
TBLPROPERTIES (
'key' 'value'
)
// Case 2.1: migrate a location using HadoopTables (or path-based catalog)
for target
MIGRATE TABLE parquet.`path/to/table`
USING iceberg
TBLPROPERTIES (
'key' 'value'
)
// Case 2.2: migrate a table in HMS and replacing the pointer in HMS
MIGRATE TABLE hive_table
USING iceberg
TBLPROPERTIES (
'key' 'value'
)
```
We need to clarify how our path-based integration will look like. @rdblue
@RussellSpitzer, will our existing Spark catalogs have references to
`HiveCatalog` as well as to `HadoopTables` or will there be a new Iceberg
catalog for path-based tables? If we go with the second option (separate
catalog), then we will not be able to use the same session catalog to migrate
both HMS-based and path-based tables. That seems like a substantial limitation.
I can see people using both in some cases. Maybe not too common?
I think we should clarify whether `MIGRATE` should be limited to the session
catalog only. If so, will we be able to migrate both path-based and metastore
tables using the same session catalog? If we want to support not only the
session catalog, we may reconsider `MIGRATE t1 AS t2` syntax that we discarded
in the first place.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]