[jira] [Updated] (ATLAS-5317) Make Atlas Purge API more resilient

Sheetal Shah (Jira) Fri, 26 Jun 2026 01:22:14 -0700


     [ 
https://issues.apache.org/jira/browse/ATLAS-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sheetal Shah updated ATLAS-5317:
--------------------------------
    Description: 
h2. Problem Statement

Atlas exposes a purge API ({{{}PUT /api/atlas/admin/purge{}}}) to hard-delete 
entities. The API accepts a batch of GUIDs but fails the entire request if any 
single entity delete throws an exception. This all-or-nothing behavior blocks 
large clean-up jobs.

Key issues:
 * One corrupt, missing, or locked GUID causes the entire batch to roll back 
with HTTP 500
 * No structured failure reporting — bad GUIDs are only logged as {{{}WARN{}}}; 
callers cannot identify which GUIDs failed
 * Audit entry stores all input GUIDs in a single row, which can exceed safe 
size limits and cause transaction rollbacks
 * REST purge and background {{PurgeService}} cron can run concurrently on the 
same GUIDs, causing {{PermanentLockingException}}
 * No input validation — non-GUID strings passed to the API cause unexpected 
failures

----
h2. Requirements
 # Resilient purge — Continue purging remaining entities on per-entity error. 
Return {{failedEntities}} (guid, error code, message) alongside successfully 
purged entities. Return HTTP 207 on partial success.
 # Bounded audit — Write one audit entry per mini-batch instead of one 
oversized entry per request.
 # Fix purge logic — Process in mini-batches ; retry on locking errors; isolate 
corrupt/missing GUIDs as skippable instead of failing the whole batch. Validate 
input and reject non-GUID strings with HTTP 400. Prevent concurrent REST + 
scheduled purge conflicts.

  was:
h2. Problem Statement

Atlas exposes a purge API ({{{}PUT /api/atlas/admin/purge{}}}) to hard-delete 
entities. The API accepts a batch of GUIDs but fails the entire request if any 
single entity delete throws an exception. This all-or-nothing behavior blocks 
large clean-up jobs.

Key issues:
 * One corrupt, missing, or locked GUID causes the entire batch to roll back 
with HTTP 500
 * No structured failure reporting — bad GUIDs are only logged as {{{}WARN{}}}; 
callers cannot identify which GUIDs failed
 * Audit entry stores all input GUIDs in a single row, which can exceed safe 
size limits and cause transaction rollbacks
 * REST purge and background {{PurgeService}} cron can run concurrently on the 
same GUIDs, causing {{PermanentLockingException}}
 * No input validation — non-GUID strings passed to the API cause unexpected 
failures

----
h2. Requirements
 # Resilient purge — Continue purging remaining entities on per-entity error. 
Return {{failedEntities}} (guid, error code, message) alongside successfully 
purged entities. Return HTTP 207 on partial success.
 # Bounded audit — Write one audit entry per mini-batch instead of one 
oversized entry per request.
 # Fix purge logic — Process in mini-batches (default 50 GUIDs per 
transaction); retry on locking errors; isolate corrupt/missing GUIDs as 
skippable instead of failing the whole batch. Validate input and reject 
non-GUID strings with HTTP 400. Prevent concurrent REST + scheduled purge 
conflicts.


> Make Atlas Purge API more resilient
> -----------------------------------
>
>                 Key: ATLAS-5317
>                 URL: https://issues.apache.org/jira/browse/ATLAS-5317
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>            Reporter: Sheetal Shah
>            Assignee: Sheetal Shah
>            Priority: Major
>
> h2. Problem Statement
> Atlas exposes a purge API ({{{}PUT /api/atlas/admin/purge{}}}) to hard-delete 
> entities. The API accepts a batch of GUIDs but fails the entire request if 
> any single entity delete throws an exception. This all-or-nothing behavior 
> blocks large clean-up jobs.
> Key issues:
>  * One corrupt, missing, or locked GUID causes the entire batch to roll back 
> with HTTP 500
>  * No structured failure reporting — bad GUIDs are only logged as 
> {{{}WARN{}}}; callers cannot identify which GUIDs failed
>  * Audit entry stores all input GUIDs in a single row, which can exceed safe 
> size limits and cause transaction rollbacks
>  * REST purge and background {{PurgeService}} cron can run concurrently on 
> the same GUIDs, causing {{PermanentLockingException}}
>  * No input validation — non-GUID strings passed to the API cause unexpected 
> failures
> ----
> h2. Requirements
>  # Resilient purge — Continue purging remaining entities on per-entity error. 
> Return {{failedEntities}} (guid, error code, message) alongside successfully 
> purged entities. Return HTTP 207 on partial success.
>  # Bounded audit — Write one audit entry per mini-batch instead of one 
> oversized entry per request.
>  # Fix purge logic — Process in mini-batches ; retry on locking errors; 
> isolate corrupt/missing GUIDs as skippable instead of failing the whole 
> batch. Validate input and reject non-GUID strings with HTTP 400. Prevent 
> concurrent REST + scheduled purge conflicts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ATLAS-5317) Make Atlas Purge API more resilient

Reply via email to