[ https://issues.apache.org/jira/browse/HUDI-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149186#comment-17149186 ]
vinoyang edited comment on HUDI-480 at 7/1/20, 7:28 AM: -------------------------------------------------------- [~vinoth] For the hard deletion, can we log the row key list as the metadata of a commit? was (Author: yanghua): For the hard deletion, can we log the row key list as the metadata of a commit? > Support a querying delete data methond in incremental view > ---------------------------------------------------------- > > Key: HUDI-480 > URL: https://issues.apache.org/jira/browse/HUDI-480 > Project: Apache Hudi > Issue Type: Improvement > Components: Incremental Pull > Reporter: cdmikechen > Priority: Minor > > As we known, hudi have supported many method to query data in Spark and Hive > and Presto. And it also provides a very good timeline idea to trace changes > in data, and it can be used to query incremental data in incremental view. > In old time, we just have insert and update funciton to upsert data, and now > we have added new functions to delete some existing data. > *[HUDI-328] Adding delete api to HoodieWriteClient* > https://github.com/apache/incubator-hudi/pull/1004 > *[HUDI-377] Adding Delete() support to > DeltaStreamer**https://github.com/apache/incubator-hudi/pull/1073 > So I think if we have delete api, should we add another method to get deleted > data in incremental view? > I've looked at the methods for generating new parquet files. I think the main > idea is to combine old and new data, and then filter the data which need to > be deleted, so that the deleted data does not exist in the new dataset. > However, in this way, the data to be deleted will not be retained in new > dataset, so that only the inserted or modified data can be found according to > the existing timestamp field during data tracing in incremental view. > If we can do it, I feel that there are two ideas to consider: > 1. Trace the dataset in the same file at different time check points > according to the timeline, compare the two datasets according to the key and > filter out the deleted data. This method does not consume extra when writing, > but it needs to call the analysis function according to the actual request > during query, which consumes a lot. > 2. When writing data, if there is any deleted data, we will record it. File > name such as *.delete_filename_version_timestamp*. So that we can immediately > give feedback according to the time. But additional processing will be done > at the time of writing. > -- This message was sent by Atlassian Jira (v8.3.4#803005)