ashvina commented on code in PR #605: URL: https://github.com/apache/incubator-xtable/pull/605#discussion_r1898160059
########## rfc/rfc-1/rfc-1.md: ########## @@ -0,0 +1,139 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-[1]: XCatalogSync - Synchronize tables across catalogs + +## Proposers + +- @vinishjail97 + +## Approvers + +- Anyone from XTable community can approve/add feedback. + +## Status + +GH Feature Request: https://github.com/apache/incubator-xtable/issues/590 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Users of Apache XTable (Incubating) today can translate metadata across table formats (iceberg, hudi, and delta) and use the tables in different platforms depending on their choice. +Today there's still some friction involved in terms of usability because users need to explicitly [register](https://xtable.apache.org/docs/catalogs-index) the tables in the catalog of their choice (glue, HMS, unity, bigLake etc.) Review Comment: Hi @vinishjail97, thanks for sharing the details. I think catalog sync is a useful feature. One of the key value adds of catalogs is governance, particularly access control. All the catalogs mentioned here provide the ability to grant different privileges to roles. The proposed catalog sync in XTable replicates the table across catalogs. What are your thoughts about porting the governance features? ########## rfc/rfc-1/rfc-1.md: ########## @@ -0,0 +1,139 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-[1]: XCatalogSync - Synchronize tables across catalogs + +## Proposers + +- @vinishjail97 + +## Approvers + +- Anyone from XTable community can approve/add feedback. + +## Status + +GH Feature Request: https://github.com/apache/incubator-xtable/issues/590 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Users of Apache XTable (Incubating) today can translate metadata across table formats (iceberg, hudi, and delta) and use the tables in different platforms depending on their choice. +Today there's still some friction involved in terms of usability because users need to explicitly [register](https://xtable.apache.org/docs/catalogs-index) the tables in the catalog of their choice (glue, HMS, unity, bigLake etc.) +and then use the catalog in the platform of their choice to do DDL, DML queries. + +## Background +XTable is built on the principle of omnidirectional interoperability, and I'm proposing an interface which allows syncing metadata of table formats to multiple catalogs in a continuous and incremental manner. With this new functionality we will be able to +1. Reduce friction for XTable users - XTable sync will register the tables in the catalogs of their choice after metadata generation. If users are using a single format, they can still use XTable to sync the metadata across multiple catalogs. +2. Avoid catalog lock-in - There's no reason why data/metadata in storage should be registered in a single catalog, users can register the table across multiple catalogs depending on the use-case, ecosystem and features provided by the catalog. + +## Implementation + +Introducing the following interfaces. [[PR]]( https://github.com/apache/incubator-xtable/pull/603) +1. `CatalogSyncClient`: This interface contains methods that are responsible for creating table, refreshing table metadata, dropping table etc. in target catalog. Consider this interface as a translation layer between InternalTable and the catalog's table object. Review Comment: I don't think table DDL operations are related to InternalTable. But you bring up an important point. The table format and the catalog layer are two different layers in the analytics stack. Currently, XTable only supports conversion of table format level metadata, which is captured by current InternalTable. However, the proposed feature extends to the catalog layer where catalog level metadata translation takes place. So, in effect, this feature syncs InternalCatalog object. What are your thoughts? ########## rfc/rfc-1/rfc-1.md: ########## @@ -0,0 +1,139 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-[1]: XCatalogSync - Synchronize tables across catalogs + +## Proposers + +- @vinishjail97 + +## Approvers + +- Anyone from XTable community can approve/add feedback. + +## Status + +GH Feature Request: https://github.com/apache/incubator-xtable/issues/590 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Users of Apache XTable (Incubating) today can translate metadata across table formats (iceberg, hudi, and delta) and use the tables in different platforms depending on their choice. +Today there's still some friction involved in terms of usability because users need to explicitly [register](https://xtable.apache.org/docs/catalogs-index) the tables in the catalog of their choice (glue, HMS, unity, bigLake etc.) +and then use the catalog in the platform of their choice to do DDL, DML queries. + +## Background +XTable is built on the principle of omnidirectional interoperability, and I'm proposing an interface which allows syncing metadata of table formats to multiple catalogs in a continuous and incremental manner. With this new functionality we will be able to +1. Reduce friction for XTable users - XTable sync will register the tables in the catalogs of their choice after metadata generation. If users are using a single format, they can still use XTable to sync the metadata across multiple catalogs. +2. Avoid catalog lock-in - There's no reason why data/metadata in storage should be registered in a single catalog, users can register the table across multiple catalogs depending on the use-case, ecosystem and features provided by the catalog. + +## Implementation + +Introducing the following interfaces. [[PR]]( https://github.com/apache/incubator-xtable/pull/603) +1. `CatalogSyncClient`: This interface contains methods that are responsible for creating table, refreshing table metadata, dropping table etc. in target catalog. Consider this interface as a translation layer between InternalTable and the catalog's table object. +2. `CatalogSync`: This interface synchronizes the internal XTable object (InternalTable) to multiple target catalogs using the methods available in `CatalogSyncClient` interface. +3. `CatalogTableIdentifier`: Represents a catalog table identifier in a multi-level catalog system. `HierarchicalTableIdentifier` is an internal representation of a fully qualified table identifier within a catalog following the three level hierarchy convention (it's used by all the major catalogs glue, hms, unity etc.). In the future, we can support other conventions by implementing this interface. + +For XTable users, defining their source/target catalog configurations and synchronizing tables will be handled by the `RunCatalogSync` class. This utility class parses the user’s YAML configuration, synchronizes table format metadata when necessary, and then uses the previously defined interfaces to synchronize the table in the catalog. +[[PR]]( https://github.com/apache/incubator-xtable/pull/591) + +User's YAML configuration. +1. `sourceCatalog`: Configuration of the source catalog from which XTable will read. It must contain all the necessary connection and access details for describing and listing tables. + 1. `catalogId`: A user-defined unique identifier for the catalog, allows user to sync table to multiple catalogs of the same name/type eg: HMS catalog with url1, HMS catalog with url2. + 2. `catalogType`: The type of the source catalog. This might be a specific type understood by XTable, such as Hive, Glue etc. + 3. `catalogSyncClientImpl`(optional): A fully qualified class name that implements the interface for `CatalogSyncClient`, it can be used if the implementation for catalogType doesn't exist in XTable. + 4. `catalogConversionSourceImpl`(optional): A fully qualified class name that implements the interface for `CatalogConversionSource`, it can be used if the implementation for catalogType doesn't exist in XTable. Review Comment: It is not clear what is the difference in the role of CatalogConversionSource and CatalogSyncClient. Could you please clarify. In case of table sync, only TableSource exists, there is no TableClient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org