[
https://issues.apache.org/jira/browse/ATLAS-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sheetal Shah updated ATLAS-5317:
--------------------------------
Description:
h2. Problem Statement
Atlas exposes a purge API ({{{}PUT /api/atlas/admin/purge{}}}) to hard-delete
entities. The API accepts a batch of GUIDs but fails the entire request if any
single entity delete throws an exception. This all-or-nothing behavior blocks
large clean-up jobs.
Key issues:
* One corrupt, missing, or locked GUID causes the entire batch to roll back
with HTTP 500
* No structured failure reporting — bad GUIDs are only logged as {{{}WARN{}}};
callers cannot identify which GUIDs failed
* Audit entry stores all input GUIDs in a single row, which can exceed safe
size limits and cause transaction rollbacks
* REST purge and background {{PurgeService}} cron can run concurrently on the
same GUIDs, causing {{PermanentLockingException}}
* No input validation — non-GUID strings passed to the API cause unexpected
failures
----
h2. Requirements
# Resilient purge — Continue purging remaining entities on per-entity error.
Return {{failedEntities}} (guid, error code, message) alongside successfully
purged entities. Return HTTP 207 on partial success.
# Bounded audit — Write one audit entry per mini-batch instead of one
oversized entry per request.
# Fix purge logic — Process in mini-batches ; retry on locking errors; isolate
corrupt/missing GUIDs as skippable instead of failing the whole batch. Validate
input and reject non-GUID strings with HTTP 400. Prevent concurrent REST +
scheduled purge conflicts.
was:
h2. Problem Statement
Atlas exposes a purge API ({{{}PUT /api/atlas/admin/purge{}}}) to hard-delete
entities. The API accepts a batch of GUIDs but fails the entire request if any
single entity delete throws an exception. This all-or-nothing behavior blocks
large clean-up jobs.
Key issues:
* One corrupt, missing, or locked GUID causes the entire batch to roll back
with HTTP 500
* No structured failure reporting — bad GUIDs are only logged as {{{}WARN{}}};
callers cannot identify which GUIDs failed
* Audit entry stores all input GUIDs in a single row, which can exceed safe
size limits and cause transaction rollbacks
* REST purge and background {{PurgeService}} cron can run concurrently on the
same GUIDs, causing {{PermanentLockingException}}
* No input validation — non-GUID strings passed to the API cause unexpected
failures
----
h2. Requirements
# Resilient purge — Continue purging remaining entities on per-entity error.
Return {{failedEntities}} (guid, error code, message) alongside successfully
purged entities. Return HTTP 207 on partial success.
# Bounded audit — Write one audit entry per mini-batch instead of one
oversized entry per request.
# Fix purge logic — Process in mini-batches (default 50 GUIDs per
transaction); retry on locking errors; isolate corrupt/missing GUIDs as
skippable instead of failing the whole batch. Validate input and reject
non-GUID strings with HTTP 400. Prevent concurrent REST + scheduled purge
conflicts.
> Make Atlas Purge API more resilient
> -----------------------------------
>
> Key: ATLAS-5317
> URL: https://issues.apache.org/jira/browse/ATLAS-5317
> Project: Atlas
> Issue Type: Bug
> Components: atlas-core
> Reporter: Sheetal Shah
> Assignee: Sheetal Shah
> Priority: Major
>
> h2. Problem Statement
> Atlas exposes a purge API ({{{}PUT /api/atlas/admin/purge{}}}) to hard-delete
> entities. The API accepts a batch of GUIDs but fails the entire request if
> any single entity delete throws an exception. This all-or-nothing behavior
> blocks large clean-up jobs.
> Key issues:
> * One corrupt, missing, or locked GUID causes the entire batch to roll back
> with HTTP 500
> * No structured failure reporting — bad GUIDs are only logged as
> {{{}WARN{}}}; callers cannot identify which GUIDs failed
> * Audit entry stores all input GUIDs in a single row, which can exceed safe
> size limits and cause transaction rollbacks
> * REST purge and background {{PurgeService}} cron can run concurrently on
> the same GUIDs, causing {{PermanentLockingException}}
> * No input validation — non-GUID strings passed to the API cause unexpected
> failures
> ----
> h2. Requirements
> # Resilient purge — Continue purging remaining entities on per-entity error.
> Return {{failedEntities}} (guid, error code, message) alongside successfully
> purged entities. Return HTTP 207 on partial success.
> # Bounded audit — Write one audit entry per mini-batch instead of one
> oversized entry per request.
> # Fix purge logic — Process in mini-batches ; retry on locking errors;
> isolate corrupt/missing GUIDs as skippable instead of failing the whole
> batch. Validate input and reject non-GUID strings with HTTP 400. Prevent
> concurrent REST + scheduled purge conflicts.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)