[jira] [Commented] (HUDI-480) Support a querying delete data methond in incremental view

Nie Gus (Jira) Wed, 16 Mar 2022 03:08:05 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507460#comment-17507460
 ]


Nie Gus commented on HUDI-480:
------------------------------

[~vinoth]  [~chenxiang] 

we also face the similar issue, our job's business logic need to figure out one 
record's current and previous value, so the developer can compare the two 
values, then decide to proceed different logic , so we are really considering 
your idea in the top comment you mention in this Jira:

<<<<<<<<<<<<<<

Do you think this is specific to deletes? Could we generalize this to a new 
config say 'include.before.image=true' in incremental pull where, you get two 
values in incremental pull. Currently, you will only get one value per record 
upserted/deleted

 

 
||Operation||include.before.image=false||include.before.image=true||
|insert|new_value_inserted|[null, new_value_inserted]|
|update/soft delete|new_value_updated|[old_value, new_value]|
|hard delete|May not get anything today.|[deleted_value, null]|

<<<<<<<<<<<<<<<<<<<<<<<<

we are thinking developing an udf to support this, like below:

select status,pre_value(status) as pre_status

from table_xxxx where _hoodie_commit_time > xxxx;

 

because we havent figure out how to show the result from sql for one column 
"old_value, new_value"...  any thought? or is there any feature ongoing can 
meet our requirement ? thanks.

 

> Support a querying delete data methond in incremental view
> ----------------------------------------------------------
>
>                 Key: HUDI-480
>                 URL: https://issues.apache.org/jira/browse/HUDI-480
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: incremental-query
>            Reporter: cdmikechen
>            Assignee: cdmikechen
>            Priority: Minor
>
> As we known, hudi have supported many method to query data in Spark and Hive 
> and Presto. And it also provides a very good timeline idea to trace changes 
> in data, and it can be used to query incremental data in incremental view.
> In old time, we just have insert and update funciton to upsert data, and now 
> we have added new functions to delete some existing data.
> *[HUDI-328] Adding delete api to HoodieWriteClient* 
> https://github.com/apache/incubator-hudi/pull/1004
> *[HUDI-377] Adding Delete() support to 
> DeltaStreamer**https://github.com/apache/incubator-hudi/pull/1073
> So I think if we have delete api, should we add another method to get deleted 
> data in incremental view?
> I've looked at the methods for generating new parquet files. I think the main 
> idea is to combine old and new data, and then filter the data which need to 
> be deleted, so that the deleted data does not exist in the new dataset. 
> However, in this way, the data to be deleted will not be retained in new 
> dataset, so that only the inserted or modified data can be found according to 
> the existing timestamp field during data tracing in incremental view.
> If we can do it, I feel that there are two ideas to consider:
> 1. Trace the dataset in the same file at different time check points 
> according to the timeline, compare the two datasets according to the key and 
> filter out the deleted data. This method does not consume extra when writing, 
> but it needs to call the analysis function according to the actual request 
> during query, which consumes a lot.
> 2. When writing data, if there is any deleted data, we will record it. File 
> name such as *.delete_filename_version_timestamp*. So that we can immediately 
> give feedback according to the time. But additional processing will be done 
> at the time of writing.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-480) Support a querying delete data methond in incremental view

Reply via email to