That's my expectation as well. Spark needs a reliable catalog.
backup/restore is just implementation details about how you make your
catalog reliable, which should be transparent to Spark.

On Sat, May 8, 2021 at 6:54 AM ayan guha <guha.a...@gmail.com> wrote:

> Just a consideration:
>
> Is there a value in backup/restore metadata within spark? I would strongly
> argue if the metadata is valuable enough and persistent enough, why dont
> just use external metastore? It is fairly straightforward process. Also
> regardless you are in cloud or not, database bkp is a routine and
> established pattern in most organizations.
> You can also enhance HA and DR by having replicas across zones and regions
> etc etc
>
> Thoughts?
>
>
>
>
> On Sat, 8 May 2021 at 7:02 am, Tianchen Zhang <dustinzhang2...@gmail.com>
> wrote:
>
>> For now we are thinking about adding two methods in Catalog API, not SQL
>> commands:
>> 1. spark.catalog.backup, which backs up the current catalog.
>> 2. spark.catalog.restore(file), which reads the DFS file and recreates
>> the entities described in that file.
>>
>> Can you please give an example of exposing client APIs to the end users
>> in this approach? The users can only call backup or restore, right?
>>
>> Thanks,
>> Tianchen
>>
>> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> If a catalog implements backup/restore, it can easily expose some client
>>> APIs to the end-users (e.g. REST API), I don't see a strong reason to
>>> expose the APIs to Spark. Do you plan to add new SQL commands in Spark to
>>> backup/restore a catalog?
>>>
>>> On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang <dustinzhang2...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Currently the user-facing Catalog API doesn't support backup/restore
>>>> metadata. Our customers are asking for such functionalities. Here is a
>>>> usage example:
>>>> 1. Read all metadata of one Spark cluster
>>>> 2. Save them into a Parquet file on DFS
>>>> 3. Read the Parquet file and restore all metadata in another Spark
>>>> cluster
>>>>
>>>> From the current implementation, Catalog API has the list methods
>>>> (listDatabases, listFunctions, etc.) but they don't return enough
>>>> information in order to restore an entity (for example, listDatabases lose
>>>> "properties" of the database and we need "describe database extended" to
>>>> get them). And it only supports createTable (not any other entity
>>>> creations). The only way we can backup/restore an entity is using Spark 
>>>> SQL.
>>>>
>>>> We want to introduce the backup and restore from an API level. We are
>>>> thinking of doing this simply by adding backup() and restore() in
>>>> CatalogImpl, as ExternalCatalog already includes all the methods we need to
>>>> retrieve and recreate entities. We are wondering if there is any concern or
>>>> drawback of this approach. Please advise.
>>>>
>>>> Thank you in advance,
>>>> Tianchen
>>>>
>>> --
> Best Regards,
> Ayan Guha
>

Reply via email to