[
https://issues.apache.org/jira/browse/HUDI-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153241#comment-17153241
]
vinoyang commented on HUDI-480:
-------------------------------
Yes, you provide a differentiated implementation of this feature based on
different storage views.
My initial idea may have gone another way, which is a bit different from yours.
It has two key points:
1) At the feature level, both views (COW, MOR) are supported. I think if we
move towards a bifurcation in the feature, will it give users a bad experience
in the future; they need to infer based on the feature. Selected view? A better
experience is to choose the view according to their own scene, but each view
provides similar functional features.
2) The implementation level is consistent, similar to our design for the two
views on the index. We may introduce external storage to store these "metadata"
(row keys), so that it will be more likely to be implemented as a plug-in, and
the metadata storage framework can be replaced, and it is also a lightweight
for Hudi Level metadata system.
The above are some of my previous points, just because my understanding of the
core implementation of Hudi is not thorough enough.
Of course, I also agree to focus directly on Hudi's storage and views, which
can avoid introducing too many external dependencies. But we need to provide
implementation methods for two different storage views.
> Support a querying delete data methond in incremental view
> ----------------------------------------------------------
>
> Key: HUDI-480
> URL: https://issues.apache.org/jira/browse/HUDI-480
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Incremental Pull
> Reporter: cdmikechen
> Priority: Minor
>
> As we known, hudi have supported many method to query data in Spark and Hive
> and Presto. And it also provides a very good timeline idea to trace changes
> in data, and it can be used to query incremental data in incremental view.
> In old time, we just have insert and update funciton to upsert data, and now
> we have added new functions to delete some existing data.
> *[HUDI-328] Adding delete api to HoodieWriteClient*
> https://github.com/apache/incubator-hudi/pull/1004
> *[HUDI-377] Adding Delete() support to
> DeltaStreamer**https://github.com/apache/incubator-hudi/pull/1073
> So I think if we have delete api, should we add another method to get deleted
> data in incremental view?
> I've looked at the methods for generating new parquet files. I think the main
> idea is to combine old and new data, and then filter the data which need to
> be deleted, so that the deleted data does not exist in the new dataset.
> However, in this way, the data to be deleted will not be retained in new
> dataset, so that only the inserted or modified data can be found according to
> the existing timestamp field during data tracing in incremental view.
> If we can do it, I feel that there are two ideas to consider:
> 1. Trace the dataset in the same file at different time check points
> according to the timeline, compare the two datasets according to the key and
> filter out the deleted data. This method does not consume extra when writing,
> but it needs to call the analysis function according to the actual request
> during query, which consumes a lot.
> 2. When writing data, if there is any deleted data, we will record it. File
> name such as *.delete_filename_version_timestamp*. So that we can immediately
> give feedback according to the time. But additional processing will be done
> at the time of writing.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)