[ 
https://issues.apache.org/jira/browse/FALCON-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095846#comment-14095846
 ] 

Sowmya Ramesh commented on FALCON-594:
--------------------------------------

Multiple approaches have been identified for adding lineage information for 
eviction policy.

*Approach 1:*

On execution of eviction policy delete the identified feed instance vertices 
from graph. For completeness the associated entities vertices should also be 
deleted i.e. cascade delete.

Pros:
- As the identified feed instance vertices are deleted graph DB won't keep 
growing and hence no storage space issues.

Cons:
- Since eviction history is not preserved this information cannot be retrieved 
at later point of time.

*Approach 2:*

- On execution of eviction policy delete the identified feed instance vertices 
[cascade delete].
- For each identified feed entity vertex create a common Evicted vertex and add 
an edge with label "evicted". Add a property to identify the feed instance 
vertex evicted [fi], timestamp of eviction[ti], WF id[wi]. Instead of creating 
a new common vertex self loop can be added

Pros:
- As the identified feed instance vertices are deleted graph DB won't keep 
growing and hence no storage space issues
- Some details about eviction is being stored in graph DB. This would enable 
getting details about eviction

Cons:
- Compared to Approach 1 requires more storage as we store some details related 
to eviction
- For each evicted instance property [fi, ti, wi] is added. In order to get the 
eviction details this property has to be parsed leading to performance issues

*Approach 3:*
Create a common Evicted vertex and on execution of eviction policy add an edge 
label "evicted" from each identified feed instance vertex to this.

Pros:
- Approach is simple in terms of implementation
- Retaining all the details of evicted feed instances for historical queries

Cons:
- Storage and performance issues as graphDB keeps growing

*Approach 4*
On execution of retention policy add "evicted" property to each identified feed 
instance vertex. Do some cleanup based on time limit that ought to be available 
to avoid graph DB from growing leading to storage/performance related issues 
[FALCON-335|https://issues.apache.org/jira/browse/FALCON-335].

Pros:
- Retaining all the details of evicted feed instances for historical queries

Cons:
-  Storage and performance issues as graphDB keeps growing

In addition the decision to purge the vertices can be based on user input to 
preserve the history or not. In this case multiple approaches has to be 
implemented. 
Instead of deleting vertices right away there can be time limit to do DB 
cleanup.

Approach 4 is identified as a feasible solution. Please comment if you have any 
concerns or inputs.

Thanks!



> Process lineage information for Retention policies
> --------------------------------------------------
>
>                 Key: FALCON-594
>                 URL: https://issues.apache.org/jira/browse/FALCON-594
>             Project: Falcon
>          Issue Type: Sub-task
>            Reporter: Sowmya Ramesh
>            Assignee: Sowmya Ramesh
>
> Falcon currently addresses process executions and not data lifecycle 
> policies. This task should address adding this information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to