[ 
https://issues.apache.org/jira/browse/HUDI-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pramod Biligiri updated HUDI-4994:
----------------------------------
    Description: 
Datahub has a notion of soft-deletes (the entity still exists in the database 
with a status=removed:true). Such entities could get re-ingested with new 
properties at a later time, such that the older one gets overwritten. The 
current implementation in DatahubSyncTool does not handle this scenario. It 
fails to update the status flag to removed:false during ingest, which means the 
entity won't surface in the Datahub UI at all.

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
[https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]

  was:
When DatahubSyncTool updates an entity in Datahub using an UPSERT request of 
their RestEmiiter client, it can be assumed that the entity is no longer 
considered deleted, and needs to be discoverable henceforth in the Datahub UI.

For that, it is necessary to explicitly set the "status" metadata aspect of the 
entity to "\{'removed':false}". This will handle the situation where the entity 
may have been (soft) deleted in the past. The addition of this "removed:false" 
for "status" aspect has no impact on newly created entities, or hard-deleted 
entities (of which no trace remains anyway).

Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default

        Summary: DatahubSyncTool does not correctly re-ingest soft-deleted 
entities  (was: DatahubSyncTool should set "removed" status of an entity to 
false when updating it)

> DatahubSyncTool does not correctly re-ingest soft-deleted entities
> ------------------------------------------------------------------
>
>                 Key: HUDI-4994
>                 URL: https://issues.apache.org/jira/browse/HUDI-4994
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: meta-sync
>            Reporter: Pramod Biligiri
>            Priority: Major
>              Labels: pull-request-available
>
> Datahub has a notion of soft-deletes (the entity still exists in the database 
> with a status=removed:true). Such entities could get re-ingested with new 
> properties at a later time, such that the older one gets overwritten. The 
> current implementation in DatahubSyncTool does not handle this scenario. It 
> fails to update the status flag to removed:false during ingest, which means 
> the entity won't surface in the Datahub UI at all.
> Ref: See sections on Soft Delete and Hard Delete in the Datahub docs: 
> [https://datahubproject.io/docs/how/delete-metadata/#soft-delete-the-default]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to