On Mon, Sep 23, 2024 at 11:29 AM pinal shah <[email protected]> wrote:
> Hi Aman Kumar,
>
> To permanently delete records from Atlas, you can use the REST API call
> PUT /admin/purge.
> But this API purges only soft deleted entities, hence before using this
> api, delete the required entities using either DELETE v2/entity/bulk/guid=
> or v2/entity/DELETE /guid/{guid}
>
> Below points to be noted when you purge a deleted entity:
> - The entity is removed from Atlas.
> - Related, dependent entities are also removed. For example, when purging
> a deleted Hive table, the deleted entities for the table columns, DLL, and
> storage description are also purged.
> - The entity is no longer available in search results, even with Show
> historical entities enabled.
> - Lineage relationships that include the purged entities are removed,
> which breaks lineages that depend upon a purged entity to show connections
> between ancestors and descendents.
> - Classifications propagated across the purged entities are removed in all
> descendent entities.
> - Classifications assigned to the purged entities and set to propagate are
> removed from all descendent entities.
>
> Regards,
> Pinal Shah
>
> On Fri, Sep 20, 2024 at 5:53 PM Aman Kumar <[email protected]> wrote:
>
>> Hi Team,
>>
>> I'm using apache-atlas-2.3.0 with embedded hbase and solr.
>>
>> Problem Statement:
>>
>> - Row count in 'apache_atlas_janus' table is increasing exponentially
>> whenever a new typedef, new entity is created/updated. While this is
>> not an issue with embedded hbase and solr distribution, In production,
>> we are reaching around 20-30M row count in janus table
>>
>> - While apache-atlas provides greater control on setting up the TTL on
>> 'audit table' (
>> https://issues.apache.org/jira/browse/ATLAS-4768
>> ), there's no way to control TTL on the 'janus table', other than to
>> setup TTL on the column families manually using hbase shell.
>>
>> - Records on 'janus table' are not human-readable, since they
>> are stored in a serialized form.
>>
>> - Setting up TTL manually on the janus table's column families causes
>> atlas to malfunction, evidently so, because we don't know what rows
>> are getting deleted.
>>
>>
>>
>> What is required:
>>
>> -> Some level of control on the
>> janus table TTL, so as to purge out older/not required records,
>> without messing up with any other components in atlas.
>>
>> -> If TTL is not possible, then at least there should be a way to
>> deserialize the hbase rows in
>> the janus table, so that we can implement our own TTL logic.
>>
>>
>>
>> What I've tried:
>>
>> -> Reading the janus hbase table through the java code, ran into
>> this issue:
>> https://github.com/JanusGraph/janusgraph/issues/941
>>
>> -> Tried setting up the TTL on the vertices in janusgraph using
>> gremlin queries. The problem is, each vertex in atlas is defined with
>> the label of 'vertex'.
>> Setting up management object on the label itself throwing the
>> error of: 'Name cannot be in protected namespace: vertex'
>>
>> -> Even tried setting up TTL on a vertex on a local janusgraph
>> instance (without atlas). Didn't saw any difference in row count even
>> after vertex TTL is expired
>>
>> -> Atlast, tried to delete some rows in the janus table based on a
>> timestamp range, for the following scenarios:
>>
>> - Tried deleting the rows in janus table only for a single
>> update timestamp
>>
>> - Tried deleting the rows only for entity updates timestamp
>>
>> - Tried deleting the rows which were created before the latest
>> entity update
>>
>>
>> In all the cases, entity got disappeared in the UI, with the
>> following error:
>> No typename found for given entity with guid:
>> c90744dc-7ac6-4b5f-8fd2-ffc6282f5a64
>>
>>
>> In short, deleting any rows related to the entity
>> in the janus table is messing up with the entity itself.
>>
>>
>> Please let me know if there's any existing solution for the above
>> problem, or should I reach out to
>> the janusgraph community regarding the serialization/deserialization
>> issue.
>>
>> Thanks!
>>
>