cdmikechen created HUDI-480:
-------------------------------
Summary: Supporting a querying delete data methond in incremental
view
Key: HUDI-480
URL: https://issues.apache.org/jira/browse/HUDI-480
Project: Apache Hudi (incubating)
Issue Type: Bug
Components: Incremental Pull
Reporter: cdmikechen
As we known, hudi have supported many method to query data in Spark and Hive
and Presto. And it also provides a very good timeline idea to trace changes in
data, and it can be used to query incremental data in incremental view.
In old time, we just have insert and update funciton to upsert data, and now we
have added new functions to delete some existing data.
*[HUDI-328] Adding delete api to HoodieWriteClient*
https://github.com/apache/incubator-hudi/pull/1004
*[HUDI-377] Adding Delete() support to
DeltaStreamer**https://github.com/apache/incubator-hudi/pull/1073
So I think if we have delete api, should we add another method to get deleted
data in incremental view?
I've looked at the methods for generating new parquet files. I think the main
idea is to combine old and new data, and then filter the data which need to be
deleted, so that the deleted data does not exist in the new dataset. However,
in this way, the data to be deleted will not be retained in new dataset, so
that only the inserted or modified data can be found according to the existing
timestamp field during data tracing in incremental view.
If we can do it, I feel that there are two ideas to consider:
1. Trace the dataset in the same file at different time check points according
to the timeline, compare the two datasets according to the key and filter out
the deleted data. This method does not consume extra when writing, but it needs
to call the analysis function according to the actual request during query,
which consumes a lot.
2. When writing data, if there is any deleted data, we will record it. File
name such as *.delete_filename_version_timestamp*. So that we can immediately
give feedback according to the time. But additional processing will be done at
the time of writing.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)