cdmikechen opened a new issue #1140: A suggestion that supporting a querying 
delete data methond in incremental view
URL: https://github.com/apache/incubator-hudi/issues/1140
 
 
   As we known, hudi have supported many method to query data in Spark and Hive 
and Presto. And it also provides a very good timeline idea to trace changes in 
data, and it can be used to query incremental data in incremental view.
   In old time, we just have insert and update funciton to upsert data, and now 
we have added new functions to delete some existing data.
   
   **[HUDI-328] Adding delete api to HoodieWriteClient**   
https://github.com/apache/incubator-hudi/pull/1004
   **[HUDI-377] Adding Delete() support to DeltaStreamer**   
https://github.com/apache/incubator-hudi/pull/1073
   
   So I think if we have delete api, should we add another method to get 
deleted data in incremental view?
   
   I've looked at the methods for generating new parquet files. I think the 
main idea is to combine old and new data, and then filter the data which need 
to be deleted, so that the deleted data does not exist in the new dataset. 
However, in this way, the data to be deleted will not be retained in new 
dataset, so that only the inserted or modified data can be found according to 
the existing timestamp field during data tracing in incremental view.
   If we can do it, I feel that there are two ideas to consider:
   1. Trace the dataset in the same file at different time check points 
according to the timeline, compare the two datasets according to the key and 
filter out the deleted data. This method does not consume extra when writing, 
but it needs to call the analysis function according to the actual request 
during query, which consumes a lot.
   2. When writing data, if there is any deleted data, we will record it. File 
name such as `.delete_filename_version_timestamp`. So that we can immediately 
give feedback according to the time. But additional processing will be done at 
the time of writing.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to