[
https://issues.apache.org/jira/browse/HUDI-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213671#comment-17213671
]
cdmikechen edited comment on HUDI-480 at 10/14/20, 7:04 AM:
------------------------------------------------------------
[~yanghua]
In step 1, we just filter which files contains delete rows, so that we can use
delete numbers value to check out. I've write some codes like this:
{code:java}
HoodieInstant instant = timeline.filter(i ->
i.getTimestamp().equals(commitTime)).firstInstant().get();
HoodieCommitMetadata commitMetadata =
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(instant).get(),
HoodieCommitMetadata.class);
for (List<HoodieWriteStat> stats :
commitMetadata.getPartitionToWriteStats().values()) {
for (HoodieWriteStat stat : stats) {
if (stat.getPrevCommit() != null && stat.getNumDeletes() > 0) {
LOG.info("file name is {} in {} with partition {}, and prev commit
is {}",
stat.getFileId(), stat.getPath(), stat.getPartitionPath(),
stat.getPrevCommit());
}
}
}
{code}
In step 2, we can use the files filtered out in step 1 and collect files by
prev commit.
Step 1 and Step 2 can be implemented together in a method.
was (Author: chenxiang):
[~yanghua]
In step 1, we just filter which files contains delete rows, so that we can use
delete numbers value to check out. I've write some codes like this:
{code:java}
HoodieInstant instant = timeline.filter(i ->
i.getTimestamp().equals(commitTime)).firstInstant().get();
HoodieCommitMetadata commitMetadata =
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(instant).get(),
HoodieCommitMetadata.class);
for (List<HoodieWriteStat> stats :
commitMetadata.getPartitionToWriteStats().values()) {
for (HoodieWriteStat stat : stats) {
if (stat.getPrevCommit() != null && stat.getNumDeletes() > 0) {
LOG.info("file name is {} in {} with partition {}, and prev commit
is {}",
stat.getFileId(), stat.getPath(), stat.getPartitionPath(),
stat.getPrevCommit());
}
}
}
{code}
In step 2, we can use the files filtered out in step 1 and collect files by
prev commit. I've also write some codes like this:
Step 1 and Step 2 can be implemented together in a method.
> Support a querying delete data methond in incremental view
> ----------------------------------------------------------
>
> Key: HUDI-480
> URL: https://issues.apache.org/jira/browse/HUDI-480
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Incremental Pull
> Reporter: cdmikechen
> Assignee: cdmikechen
> Priority: Minor
>
> As we known, hudi have supported many method to query data in Spark and Hive
> and Presto. And it also provides a very good timeline idea to trace changes
> in data, and it can be used to query incremental data in incremental view.
> In old time, we just have insert and update funciton to upsert data, and now
> we have added new functions to delete some existing data.
> *[HUDI-328] Adding delete api to HoodieWriteClient*
> https://github.com/apache/incubator-hudi/pull/1004
> *[HUDI-377] Adding Delete() support to
> DeltaStreamer**https://github.com/apache/incubator-hudi/pull/1073
> So I think if we have delete api, should we add another method to get deleted
> data in incremental view?
> I've looked at the methods for generating new parquet files. I think the main
> idea is to combine old and new data, and then filter the data which need to
> be deleted, so that the deleted data does not exist in the new dataset.
> However, in this way, the data to be deleted will not be retained in new
> dataset, so that only the inserted or modified data can be found according to
> the existing timestamp field during data tracing in incremental view.
> If we can do it, I feel that there are two ideas to consider:
> 1. Trace the dataset in the same file at different time check points
> according to the timeline, compare the two datasets according to the key and
> filter out the deleted data. This method does not consume extra when writing,
> but it needs to call the analysis function according to the actual request
> during query, which consumes a lot.
> 2. When writing data, if there is any deleted data, we will record it. File
> name such as *.delete_filename_version_timestamp*. So that we can immediately
> give feedback according to the time. But additional processing will be done
> at the time of writing.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)