[GitHub] [iceberg] aokolnychyi commented on pull request #1525: Provide API and Implementation for Creating Iceberg Tables from Spark

GitBox Tue, 06 Oct 2020 10:07:25 -0700


aokolnychyi commented on pull request #1525:
URL: https://github.com/apache/iceberg/pull/1525#issuecomment-704410239



   I did one pass through the code and have a few questions/notes:
   
   **Iceberg vs Spark catalogs**
   
   I think it is reasonable to use the Spark catalog API instead of the Iceberg 
catalog API for the actual migration. I also like the idea of using staged 
tables. At the same time, I am not sure `stageCreateOrReplace` is the right 
semantics. I believe `replace` in case of Iceberg means replacing an Iceberg 
table with another Iceberg table.  To me, `stageCreate` sounds more appropriate 
in this case.
   
   **CreateActions vs Actions**
   
   I understand that this API targets Spark 3 while the Actions API is in the 
common module. Would it make sense to keep `BaseActions` in the shared module 
and have separate `Actions` in spark2 and spark3 modules? That way, we may add 
static methods like `snapshot` and `migrate` to the `Actions` in spark3. In 
spark2, there will no extra methods.
   
   **Session Catalog**
   
   Looks like this logic works only with the session catalog. It may be a 
reasonable limitation but I want to think through the use cases we want to 
support, especially around path-based tables (#1306).
   
   I consider this set of use cases as a minimum of what we should support for 
`SNAPSHOT` and `MIGRATE` commands:
   
   ```
   // Case 1.1: snapshot a location using HadoopTables (or a path-based 
catalog) for target
   
   SNAPSHOT TABLE parquet.`path/to/table`
   USING iceberg
   LOCATION `path/to/another/location`
   TBLPROPERTIES (
     'key' 'value'
   )
   
   // Case 1.2: snapshot a table in HMS using HiveCatalog for target
   
   SNAPSHOT TABLE hive_table AS iceberg_hive.table_name
   USING iceberg
   LOCATION `path/to/another/location`
   TBLPROPERTIES (
     'key' 'value'
   )
   
   // Case 2.1: migrate a location using HadoopTables (or path-based catalog) 
for target
   
   MIGRATE TABLE parquet.`path/to/table`
   USING iceberg
   TBLPROPERTIES (
     'key' 'value'
   )
   
   // Case 2.2: migrate a table in HMS and replacing the pointer in HMS
   
   MIGRATE TABLE hive_table
   USING iceberg
   TBLPROPERTIES (
     'key' 'value'
   )
   ```
   
   We need to clarify how our path-based integration will look like. @rdblue 
@RussellSpitzer, will our existing Spark catalogs have references to 
`HiveCatalog` as well as to `HadoopTables` or will there be a new Iceberg 
catalog for path-based tables? If we go with the second option (separate 
catalog), then we will not be able to use the same session catalog to migrate 
both HMS-based and path-based tables. That seems like a substantial limitation. 
I can see people using both in some cases. Maybe not too common?
   
   I think we should clarify whether `MIGRATE` should be limited to the session 
catalog only. If so, will we be able to migrate both path-based and metastore 
tables using the same session catalog? If we want to support not only the 
session catalog, we may reconsider `MIGRATE t1 AS t2` syntax that we discarded 
in the first place.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on pull request #1525: Provide API and Implementation for Creating Iceberg Tables from Spark

Reply via email to