That's my expectation as well. Spark needs a reliable catalog. backup/restore is just implementation details about how you make your catalog reliable, which should be transparent to Spark.
On Sat, May 8, 2021 at 6:54 AM ayan guha <guha.a...@gmail.com> wrote: > Just a consideration: > > Is there a value in backup/restore metadata within spark? I would strongly > argue if the metadata is valuable enough and persistent enough, why dont > just use external metastore? It is fairly straightforward process. Also > regardless you are in cloud or not, database bkp is a routine and > established pattern in most organizations. > You can also enhance HA and DR by having replicas across zones and regions > etc etc > > Thoughts? > > > > > On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <dustinzhang2...@gmail.com> > wrote: > >> For now we are thinking about adding two methods in Catalog API, not SQL >> commands: >> 1. spark.catalog.backup, which backs up the current catalog. >> 2. spark.catalog.restore(file), which reads the DFS file and recreates >> the entities described in that file. >> >> Can you please give an example of exposing client APIs to the end users >> in this approach? The users can only call backup or restore, right? >> >> Thanks, >> Tianchen >> >> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> If a catalog implements backup/restore, it can easily expose some client >>> APIs to the end-users (e.g. REST API), I don't see a strong reason to >>> expose the APIs to Spark. Do you plan to add new SQL commands in Spark to >>> backup/restore a catalog? >>> >>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <dustinzhang2...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> Currently the user-facing Catalog API doesn't support backup/restore >>>> metadata. Our customers are asking for such functionalities. Here is a >>>> usage example: >>>> 1. Read all metadata of one Spark cluster >>>> 2. Save them into a Parquet file on DFS >>>> 3. Read the Parquet file and restore all metadata in another Spark >>>> cluster >>>> >>>> From the current implementation, Catalog API has the list methods >>>> (listDatabases, listFunctions, etc.) but they don't return enough >>>> information in order to restore an entity (for example, listDatabases lose >>>> "properties" of the database and we need "describe database extended" to >>>> get them). And it only supports createTable (not any other entity >>>> creations). The only way we can backup/restore an entity is using Spark >>>> SQL. >>>> >>>> We want to introduce the backup and restore from an API level. We are >>>> thinking of doing this simply by adding backup() and restore() in >>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to >>>> retrieve and recreate entities. We are wondering if there is any concern or >>>> drawback of this approach. Please advise. >>>> >>>> Thank you in advance, >>>> Tianchen >>>> >>> -- > Best Regards, > Ayan Guha >