Hi Aman Kumar,
To permanently delete records from Atlas, you can use the REST API call PUT
/admin/purge.
But this API purges only soft deleted entities, hence before using this
api, delete the required entities using either DELETE v2/entity/bulk/guid=
or v2/entity/DELETE /guid/{guid}
Below points to be noted when you purge a deleted entity:
- The entity is removed from Atlas.
- Related, dependent entities are also removed. For example, when purging a
deleted Hive table, the deleted entities for the table columns, DLL, and
storage description are also purged.
- The entity is no longer available in search results, even with Show
historical entities enabled.
- Lineage relationships that include the purged entities are removed, which
breaks lineages that depend upon a purged entity to show connections
between ancestors and descendents.
- Classifications propagated across the purged entities are removed in all
descendent entities.
- Classifications assigned to the purged entities and set to propagate are
removed from all descendent entities.
Regards,
Pinal Shah
On Fri, Sep 20, 2024 at 5:53 PM Aman Kumar <[email protected]> wrote:
> Hi Team,
>
> I'm using apache-atlas-2.3.0 with embedded hbase and solr.
>
> Problem Statement:
>
> - Row count in 'apache_atlas_janus' table is increasing exponentially
> whenever a new typedef, new entity is created/updated. While this is
> not an issue with embedded hbase and solr distribution, In production,
> we are reaching around 20-30M row count in janus table
>
> - While apache-atlas provides greater control on setting up the TTL on
> 'audit table' (
> https://issues.apache.org/jira/browse/ATLAS-4768
> ), there's no way to control TTL on the 'janus table', other than to
> setup TTL on the column families manually using hbase shell.
>
> - Records on 'janus table' are not human-readable, since they
> are stored in a serialized form.
>
> - Setting up TTL manually on the janus table's column families causes
> atlas to malfunction, evidently so, because we don't know what rows
> are getting deleted.
>
>
>
> What is required:
>
> -> Some level of control on the
> janus table TTL, so as to purge out older/not required records,
> without messing up with any other components in atlas.
>
> -> If TTL is not possible, then at least there should be a way to
> deserialize the hbase rows in
> the janus table, so that we can implement our own TTL logic.
>
>
>
> What I've tried:
>
> -> Reading the janus hbase table through the java code, ran into
> this issue:
> https://github.com/JanusGraph/janusgraph/issues/941
>
> -> Tried setting up the TTL on the vertices in janusgraph using
> gremlin queries. The problem is, each vertex in atlas is defined with
> the label of 'vertex'.
> Setting up management object on the label itself throwing the
> error of: 'Name cannot be in protected namespace: vertex'
>
> -> Even tried setting up TTL on a vertex on a local janusgraph
> instance (without atlas). Didn't saw any difference in row count even
> after vertex TTL is expired
>
> -> Atlast, tried to delete some rows in the janus table based on a
> timestamp range, for the following scenarios:
>
> - Tried deleting the rows in janus table only for a single
> update timestamp
>
> - Tried deleting the rows only for entity updates timestamp
>
> - Tried deleting the rows which were created before the latest
> entity update
>
>
> In all the cases, entity got disappeared in the UI, with the
> following error:
> No typename found for given entity with guid:
> c90744dc-7ac6-4b5f-8fd2-ffc6282f5a64
>
>
> In short, deleting any rows related to the entity
> in the janus table is messing up with the entity itself.
>
>
> Please let me know if there's any existing solution for the above
> problem, or should I reach out to
> the janusgraph community regarding the serialization/deserialization issue.
>
> Thanks!
>