[GitHub] [hudi] xushiyan commented on a diff in pull request #6576: [Hudi-4678] [RFC-61] Snapshot view management

GitBox Wed, 16 Nov 2022 07:26:00 -0800


xushiyan commented on code in PR #6576:
URL: https://github.com/apache/hudi/pull/6576#discussion_r1024073357



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.

Review Comment:
   this line can be removed



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:

Review Comment:
   put newlines before and after bullet points will fix the formatting



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly

Review Comment:
   don't quite get the logic behind this 2 sentences... 1) people use savepoint 
for disaster recovery, not preventing cleaning, which is the end result, not 
the purpose. 2) Not quite sure about what the "inconvenience" and "use 
directly" mean exactly.



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation

Review Comment:
   same formatting issue



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)

Review Comment:
   ```suggestion
   ![basic_design](./basic_design.png)
   ```
   
   use relative path to show image in the PR itself



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing

Review Comment:
   how would this incremental processing of snapshot views differ from the 
existing incremental processing of the hudi table itself? is it intended for 
bigger incremental pull window? if this understanding is correct, then in btw 
the snapshots, there will be missing original commits with changed data. not 
sure how practical this is. 



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 

Review Comment:
   the format is broken. please verify locally and add newlines where applicable



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system

Review Comment:
   don't quite get the "stored as partitions" - do you mean each snapshot's 
info is saved in a partition of a some metadata table in the catalog?



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",

Review Comment:
   i don't think it needs a nested structure here. its properties are fine to 
stay at top level



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",
++              "type": "record",
++          "fields": [{
++                      "name": "tag_name",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                  "name": "retain_days",

Review Comment:
   recording expiryTimestamp at the time of generating savepoint works better



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",
++              "type": "record",
++          "fields": [{
++                      "name": "tag_name",

Review Comment:
   should follow existing naming convention camcelCase



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",
++              "type": "record",
++          "fields": [{
++                      "name": "tag_name",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                  "name": "retain_days",
++              "type": ["int", "null"],
++              "default": 0
++          },
++           {
++                      "name": "database",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                      "name": "table_name", // will infer tag_name's value if 
doesn't specific
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++          ]
++     }
++      ]
+}
+```
+
+### Meta Sync
+Create a snapshot view also will create a new external table into the 
Catalogs, and add a timestamp into tbl properties to identify which savepoint 
you are using.
+for example, if you choose Hive Metastore as the catalog and create a snapshot 
view on a Hudi table, following steps will be process:
+* create a new savepoint with tag name (let's say tbl-YYYYMMDD)
+* create an external table in HMS, table's name is tbl-YYYYMMDD
+* add as.of.instant= ${savepoint's timestamp} into table tbl-YYYYMMDD's 
storage properties ，
+
+when user query such a snapshot's external table, engines like Spark/Presto 
will get the savepoint timestamp from external table's properties then pass 
back to Hudi for time travel 
+
+### Clean service
+Normal savepoint will never be cleaned in Clean service, but a tagged 
savepoint is cleanable since it could be out-of-date.
+
+### Operations
+* Create Snapshot View
+  create savepoint on a specific commit, meanwhile, create a new external 
table name tablename_YYYYMMDD, add as.of.instant={savepoint's timestamp} into 
external table storage properties ， this table has the same basepath with the 
original Hudi table
+```sql
+call create_snapshot_view(table => 'hudi_table', commit_Time => 
'commit_timestamp_from_timeline', snapshot_table => 'snapshot_hive_table');
+```
+
+| Parameter Name   | Required | Default                       | Remarks        
                                                                               |
+|------------------| -------  
|-------------------------------|-----------------------------------------------------------------------------------------------|
+| `table`          | `true`   | `--`                          | the Hive table 
name you want to create savepoint, must be a Hudi table, without database name |
+| `commit_Time`    | `false`  | `None`                        | the commit 
timestamp from Hudi timeline, if not provided, will use the latest commit       
   |
+| `user`           | `false`  | `""`                          | the user name 
will be saved in Hudi savepoint metadata                                        
|
+| `comments`       | `false`  | `""`                          | the comment 
will be saved in Hudi savepoint metadata                                        
  |
+| `snapshot_table` | `false`  | `$table name + _$commit_time` | the snapshot 
view table name in hive                                                         
 |
+| `hms`            | `false`  | `None`                        | Hive metastore 
server used for syncing savepoint information                                  |
+
+* Delete Snapshot View
+  call delete savepoint command(via spark-sql or hudi-cli), meanwhile delete 
associated Hive table
+```sql
+call delete_snapshot_view(table => 'hudi_table', snapshot_table => 
'snapshot_hive_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to 
delete savepoint, must be a Hudi table |
+| `instant_time`   | `true`   | `--`    | the savepoint timestamp from Hudi 
timeline, must provide |
+| `hive_sync`      | `false`  | `false` | whether to delete savepoint 
timestamp from Hive table serde properties |
+| `hms`            | `false`  | `None`  | Hive metastore server used for 
deleting savepoint information |
+
+* List Snapshot View
+```sql
+call show_snapshotviews(table => 'hudi_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to show 
savepoint list, must be a Hudi table |
+
+### Mor support
+Savepoint is already support Merge-On-Read table
+
+### Precise Event time Snapshot On Merge-On-Read table
+
+handle drifted data issue 

Review Comment:
   can you elaborate?



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",
++              "type": "record",
++          "fields": [{
++                      "name": "tag_name",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                  "name": "retain_days",
++              "type": ["int", "null"],
++              "default": 0
++          },
++           {
++                      "name": "database",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                      "name": "table_name", // will infer tag_name's value if 
doesn't specific
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++          ]
++     }
++      ]
+}
+```
+
+### Meta Sync
+Create a snapshot view also will create a new external table into the 
Catalogs, and add a timestamp into tbl properties to identify which savepoint 
you are using.
+for example, if you choose Hive Metastore as the catalog and create a snapshot 
view on a Hudi table, following steps will be process:
+* create a new savepoint with tag name (let's say tbl-YYYYMMDD)
+* create an external table in HMS, table's name is tbl-YYYYMMDD
+* add as.of.instant= ${savepoint's timestamp} into table tbl-YYYYMMDD's 
storage properties ，
+
+when user query such a snapshot's external table, engines like Spark/Presto 
will get the savepoint timestamp from external table's properties then pass 
back to Hudi for time travel 
+
+### Clean service
+Normal savepoint will never be cleaned in Clean service, but a tagged 
savepoint is cleanable since it could be out-of-date.

Review Comment:
   no need to distinguish normal vs tagged savepoint. just let cleaner check 
expiryTimestamp and then decide



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",
++              "type": "record",
++          "fields": [{
++                      "name": "tag_name",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                  "name": "retain_days",
++              "type": ["int", "null"],
++              "default": 0
++          },
++           {
++                      "name": "database",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                      "name": "table_name", // will infer tag_name's value if 
doesn't specific
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++          ]
++     }
++      ]
+}
+```
+
+### Meta Sync
+Create a snapshot view also will create a new external table into the 
Catalogs, and add a timestamp into tbl properties to identify which savepoint 
you are using.
+for example, if you choose Hive Metastore as the catalog and create a snapshot 
view on a Hudi table, following steps will be process:
+* create a new savepoint with tag name (let's say tbl-YYYYMMDD)
+* create an external table in HMS, table's name is tbl-YYYYMMDD
+* add as.of.instant= ${savepoint's timestamp} into table tbl-YYYYMMDD's 
storage properties ，
+
+when user query such a snapshot's external table, engines like Spark/Presto 
will get the savepoint timestamp from external table's properties then pass 
back to Hudi for time travel 
+
+### Clean service
+Normal savepoint will never be cleaned in Clean service, but a tagged 
savepoint is cleanable since it could be out-of-date.
+
+### Operations
+* Create Snapshot View
+  create savepoint on a specific commit, meanwhile, create a new external 
table name tablename_YYYYMMDD, add as.of.instant={savepoint's timestamp} into 
external table storage properties ， this table has the same basepath with the 
original Hudi table
+```sql
+call create_snapshot_view(table => 'hudi_table', commit_Time => 
'commit_timestamp_from_timeline', snapshot_table => 'snapshot_hive_table');
+```
+
+| Parameter Name   | Required | Default                       | Remarks        
                                                                               |
+|------------------| -------  
|-------------------------------|-----------------------------------------------------------------------------------------------|
+| `table`          | `true`   | `--`                          | the Hive table 
name you want to create savepoint, must be a Hudi table, without database name |
+| `commit_Time`    | `false`  | `None`                        | the commit 
timestamp from Hudi timeline, if not provided, will use the latest commit       
   |
+| `user`           | `false`  | `""`                          | the user name 
will be saved in Hudi savepoint metadata                                        
|
+| `comments`       | `false`  | `""`                          | the comment 
will be saved in Hudi savepoint metadata                                        
  |
+| `snapshot_table` | `false`  | `$table name + _$commit_time` | the snapshot 
view table name in hive                                                         
 |
+| `hms`            | `false`  | `None`                        | Hive metastore 
server used for syncing savepoint information                                  |
+
+* Delete Snapshot View
+  call delete savepoint command(via spark-sql or hudi-cli), meanwhile delete 
associated Hive table
+```sql
+call delete_snapshot_view(table => 'hudi_table', snapshot_table => 
'snapshot_hive_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to 
delete savepoint, must be a Hudi table |
+| `instant_time`   | `true`   | `--`    | the savepoint timestamp from Hudi 
timeline, must provide |
+| `hive_sync`      | `false`  | `false` | whether to delete savepoint 
timestamp from Hive table serde properties |
+| `hms`            | `false`  | `None`  | Hive metastore server used for 
deleting savepoint information |
+
+* List Snapshot View
+```sql
+call show_snapshotviews(table => 'hudi_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to show 
savepoint list, must be a Hudi table |
+
+### Mor support
+Savepoint is already support Merge-On-Read table

Review Comment:
   don't feel it's needed calling out specific for MOR here. We can call out if 
there is special handling done for MOR.



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",
++              "type": "record",
++          "fields": [{
++                      "name": "tag_name",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                  "name": "retain_days",
++              "type": ["int", "null"],
++              "default": 0
++          },
++           {
++                      "name": "database",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                      "name": "table_name", // will infer tag_name's value if 
doesn't specific
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++          ]
++     }
++      ]
+}
+```
+
+### Meta Sync
+Create a snapshot view also will create a new external table into the 
Catalogs, and add a timestamp into tbl properties to identify which savepoint 
you are using.
+for example, if you choose Hive Metastore as the catalog and create a snapshot 
view on a Hudi table, following steps will be process:
+* create a new savepoint with tag name (let's say tbl-YYYYMMDD)
+* create an external table in HMS, table's name is tbl-YYYYMMDD
+* add as.of.instant= ${savepoint's timestamp} into table tbl-YYYYMMDD's 
storage properties ，
+
+when user query such a snapshot's external table, engines like Spark/Presto 
will get the savepoint timestamp from external table's properties then pass 
back to Hudi for time travel 
+
+### Clean service
+Normal savepoint will never be cleaned in Clean service, but a tagged 
savepoint is cleanable since it could be out-of-date.
+
+### Operations
+* Create Snapshot View
+  create savepoint on a specific commit, meanwhile, create a new external 
table name tablename_YYYYMMDD, add as.of.instant={savepoint's timestamp} into 
external table storage properties ， this table has the same basepath with the 
original Hudi table
+```sql
+call create_snapshot_view(table => 'hudi_table', commit_Time => 
'commit_timestamp_from_timeline', snapshot_table => 'snapshot_hive_table');
+```
+
+| Parameter Name   | Required | Default                       | Remarks        
                                                                               |
+|------------------| -------  
|-------------------------------|-----------------------------------------------------------------------------------------------|
+| `table`          | `true`   | `--`                          | the Hive table 
name you want to create savepoint, must be a Hudi table, without database name |
+| `commit_Time`    | `false`  | `None`                        | the commit 
timestamp from Hudi timeline, if not provided, will use the latest commit       
   |
+| `user`           | `false`  | `""`                          | the user name 
will be saved in Hudi savepoint metadata                                        
|
+| `comments`       | `false`  | `""`                          | the comment 
will be saved in Hudi savepoint metadata                                        
  |
+| `snapshot_table` | `false`  | `$table name + _$commit_time` | the snapshot 
view table name in hive                                                         
 |
+| `hms`            | `false`  | `None`                        | Hive metastore 
server used for syncing savepoint information                                  |
+
+* Delete Snapshot View
+  call delete savepoint command(via spark-sql or hudi-cli), meanwhile delete 
associated Hive table
+```sql
+call delete_snapshot_view(table => 'hudi_table', snapshot_table => 
'snapshot_hive_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to 
delete savepoint, must be a Hudi table |
+| `instant_time`   | `true`   | `--`    | the savepoint timestamp from Hudi 
timeline, must provide |
+| `hive_sync`      | `false`  | `false` | whether to delete savepoint 
timestamp from Hive table serde properties |
+| `hms`            | `false`  | `None`  | Hive metastore server used for 
deleting savepoint information |
+
+* List Snapshot View
+```sql
+call show_snapshotviews(table => 'hudi_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to show 
savepoint list, must be a Hudi table |
+
+### Mor support
+Savepoint is already support Merge-On-Read table
+
+### Precise Event time Snapshot On Merge-On-Read table
+
+handle drifted data issue 
+
+## Rollout/Adoption Plan
+there should be no impact on existing users

Review Comment:
   This section should clarify how the feature is to be rolled out. What will 
happen if users enable or disable it



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved
+We have a simple formula here to calculate the effect
+P indicates the proportion of changed data, and t indicates the number of time 
periods to be saved
+The lower the percentage of changing data, the better the storage savings
+So There is also a good savings for long periods of data
+
+At the same time, it has benefit for incremental computing resource saving
+
+3. Some typical scenarios
+   1. Every day generate a new snapshot base on original Hudi table which 
named tbl-YYYYMMDD, user can use snapshot table to generate derived tables,
+   provide report data. if user's downstream calculation logic changed, can 
choose relevant snapshot to re-process.
+   user also can set retain days as X day, clean out-of-date data 
automatically. SCD-2 should also can be achieved here.
+   2. One archived branch named yyyy-archived can be generated after compress 
and optimize. if our retention policy has been 
+   changed(let's say remove some sensitive information), then can generate a 
new snapshot base on this branch after operation done.
+   3. One Snapshot named pre-prod can release to customer after some quality 
validations passed base on any external tools.
+   
+## Implementation
+
+![basic_arch](basic_arch.png)
+
+### Extend Savepoint meta 
+Snapshot view need to extend the savepoint metadata, so we are going to add 
one struct with four fields: 
+* tag_name: tag name for your snapshot
+* retain-days: number of day, So data belongs to this snapshot will be 
retained for retain-days,  then can be clean after snapshot expire
+* database: database name in Catalog
+* table-name: table name in Catalog 
+
+new Savepoint Metadata should look like below:
+``` diff
+{
+       "type": "record",
+       "name": "HoodieSavepointMetadata",
+       "namespace": "org.apache.hudi.avro.model",
+       "fields": [{
+               "name": "savepointedBy",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "savepointedAt",
+               "type": "long"
+       }, {
+               "name": "comments",
+               "type": {
+                       "type": "string",
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "partitionMetadata",
+               "type": {
+                       "type": "map",
+                       "values": {
+                               "type": "record",
+                               "name": "HoodieSavepointPartitionMetadata",
+                               "fields": [{
+                                       "name": "partitionPath",
+                                       "type": {
+                                               "type": "string",
+                                               "avro.java.string": "String"
+                                       }
+                               }, {
+                                       "name": "savepointDataFile",
+                                       "type": {
+                                               "type": "array",
+                                               "items": {
+                                                       "type": "string",
+                                                       "avro.java.string": 
"String"
+                                               }
+                                       }
+                               }]
+                       },
+                       "avro.java.string": "String"
+               }
+       }, {
+               "name": "version",
+               "type": ["int", "null"],
+               "default": 1
++      }, {
++              "name": "tag",
++              "type": "record",
++          "fields": [{
++                      "name": "tag_name",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                  "name": "retain_days",
++              "type": ["int", "null"],
++              "default": 0
++          },
++           {
++                      "name": "database",
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++           {
++                      "name": "table_name", // will infer tag_name's value if 
doesn't specific
++                          "type": "string",
++                      "avro.java.string": "String"
++                  }, 
++          ]
++     }
++      ]
+}
+```
+
+### Meta Sync
+Create a snapshot view also will create a new external table into the 
Catalogs, and add a timestamp into tbl properties to identify which savepoint 
you are using.
+for example, if you choose Hive Metastore as the catalog and create a snapshot 
view on a Hudi table, following steps will be process:
+* create a new savepoint with tag name (let's say tbl-YYYYMMDD)
+* create an external table in HMS, table's name is tbl-YYYYMMDD
+* add as.of.instant= ${savepoint's timestamp} into table tbl-YYYYMMDD's 
storage properties ，
+
+when user query such a snapshot's external table, engines like Spark/Presto 
will get the savepoint timestamp from external table's properties then pass 
back to Hudi for time travel 
+
+### Clean service
+Normal savepoint will never be cleaned in Clean service, but a tagged 
savepoint is cleanable since it could be out-of-date.
+
+### Operations
+* Create Snapshot View
+  create savepoint on a specific commit, meanwhile, create a new external 
table name tablename_YYYYMMDD, add as.of.instant={savepoint's timestamp} into 
external table storage properties ， this table has the same basepath with the 
original Hudi table
+```sql
+call create_snapshot_view(table => 'hudi_table', commit_Time => 
'commit_timestamp_from_timeline', snapshot_table => 'snapshot_hive_table');
+```
+
+| Parameter Name   | Required | Default                       | Remarks        
                                                                               |
+|------------------| -------  
|-------------------------------|-----------------------------------------------------------------------------------------------|
+| `table`          | `true`   | `--`                          | the Hive table 
name you want to create savepoint, must be a Hudi table, without database name |
+| `commit_Time`    | `false`  | `None`                        | the commit 
timestamp from Hudi timeline, if not provided, will use the latest commit       
   |
+| `user`           | `false`  | `""`                          | the user name 
will be saved in Hudi savepoint metadata                                        
|
+| `comments`       | `false`  | `""`                          | the comment 
will be saved in Hudi savepoint metadata                                        
  |
+| `snapshot_table` | `false`  | `$table name + _$commit_time` | the snapshot 
view table name in hive                                                         
 |
+| `hms`            | `false`  | `None`                        | Hive metastore 
server used for syncing savepoint information                                  |
+
+* Delete Snapshot View
+  call delete savepoint command(via spark-sql or hudi-cli), meanwhile delete 
associated Hive table
+```sql
+call delete_snapshot_view(table => 'hudi_table', snapshot_table => 
'snapshot_hive_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to 
delete savepoint, must be a Hudi table |
+| `instant_time`   | `true`   | `--`    | the savepoint timestamp from Hudi 
timeline, must provide |
+| `hive_sync`      | `false`  | `false` | whether to delete savepoint 
timestamp from Hive table serde properties |
+| `hms`            | `false`  | `None`  | Hive metastore server used for 
deleting savepoint information |
+
+* List Snapshot View
+```sql
+call show_snapshotviews(table => 'hudi_table');
+```
+
+|  Parameter Name  | Required | Default | Remarks |
+|  --------------  | -------  | ------- | ------- |
+| `table`          | `true`   | `--`    | the Hive table name you want to show 
savepoint list, must be a Hudi table |
+
+### Mor support
+Savepoint is already support Merge-On-Read table
+
+### Precise Event time Snapshot On Merge-On-Read table
+
+handle drifted data issue 
+
+## Rollout/Adoption Plan
+there should be no impact on existing users
+
+## Test Plan
+
+Describe in few sentences how the RFC will be tested. How will we know that 
the implementation works as expected? How will we know nothing broke?.

Review Comment:
   pls fill up this section too



##########
rfc/rfc-61/rfc-61.md:
##########
@@ -0,0 +1,240 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-61: Snapshot view management
+
+
+## Proposers
+
+- @<proposer1 @fengjian428>
+
+## Approvers
+ - @<approver1 @xushiyan>
+ - @<approver2 @codope>
+
+## Status
+
+JIRA: [HUDI-4677](https://issues.apache.org/jira/browse/HUDI-4677)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+For the snapshot view scenario, Hudi already provides two key features to 
support it:
+* Time travel: user provides a timestamp to query a specific snapshot view of 
a Hudi table
+* Savepoint/restore: "savepoint" saves the table as of the commit time so that 
it lets you restore the table to this savepoint at a later point in time if 
need be.
+but in this case, the user usually uses this to prevent cleaning snapshot view 
at a specific timestamp, hence, only clean unused files
+The situation is there some inconvenience for users if they use them directly
+
+Usually users incline to use a meaningful name instead of querying Hudi table 
with a timestamp, using the timestamp in SQL may lead to the wrong snapshot 
view being used. 
+for example, we can announce that a new tag of hudi table with 
table_nameYYYYMMDD was released, then the user can use this new table name to 
query.
+Savepoint is not designed for this "snapshot view" scenario in the beginning, 
it is designed for disaster recovery. 
+let's say a new snapshot view will be created every day, and it has 7 days 
retention, we should support lifecycle management on top of it.
+What this RFC plan to do is to let Hudi support release a snapshot view and 
lifecycle management out-of-box.
+
+## Background
+Introduce any much background context which is relevant or necessary to 
understand the feature and design choices.
+typical scenarios and benefits of snapshot view:
+1. Basic idea:
+![basic_design](basic_design.png)
+
+Create Snapshot view based on Hudi Savepoint
+    * Create Snapshot views periodically by time(date time/processing time)
+    * Use External Metastore(such as HMS) to store external view 
+
+Build periodic snapshots based on the time period required by the user
+These Shapshots are stored as partitions in the metadata management system
+Users can easily use SQL to access this data in Flink Spark or Presto.
+Because the data store is complete and has no merged details,
+So the data itself is to support the full amount of data calculation, also 
support incremental processing
+
+2. Compare to Hive solution
+![resource_usage](resrouce_usage.png)
+
+The Snapshot view is created based on Hudi Savepoint, which significantly 
reduces the data storage space of some large tables
+    * The space usage becomes (1 + (t-1) * p)/t
+    * Incremental use reduces the amount of data involved in the calculation
+
+When using snapshot view storage, for some scenarios where the proportion of 
changing data is not large, a better storage saving effect will be achieved

Review Comment:
   savepoint commit will record all base files at that point of time and those 
files will be retained in the hudi table. so it's still the full data at that 
point. what storage saving is this compared against?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on a diff in pull request #6576: [Hudi-4678] [RFC-61] Snapshot view management

Reply via email to