[
https://issues.apache.org/jira/browse/FALCON-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041866#comment-15041866
]
Balu Vellanki commented on FALCON-1644:
---------------------------------------
To further explain the scenarios, a typical retention policy is used by users
to retain data. Users might want to retain instances based on
# Age limit : Any instance younger than N time units should be retained. This
is supported by Falcon today. For example, if a feed is valid on cluster from
2015-01-01T00:00Z to 2015-01-10T00:00Z. Retention age limit is 1 day. Feed
frequency is hourly
#* The instance 2015-01-01T01:00Z will be retained up to 2015-01-02T01:00Z.
This instance will be deleted by retention job running after 2015-01-02T01:00Z
#* The instance 2015-01-09T023:00Z should be retained up to 2015-01-10T023:00Z.
The user expects this instance to be deleted by retention job running after
2015-01-10T23:00Z. Today, Falcon does not run any retention jobs beyond
2015-01-10T00:00Z. So the instance 2015-01-09T023:00Z will stay forever. This
is wrong behavior and should be fixed.
# Instance Count Limit : Always retain last N instances. This is not
supported by Falcon today. If a user wants to retain last N instances of a feed
(no matter the age), Falcon should add this feature to retention lifecycle.
Hope this clears up the confusion.
> Retention : Some feed instances are never deleted by retention jobs.
> --------------------------------------------------------------------
>
> Key: FALCON-1644
> URL: https://issues.apache.org/jira/browse/FALCON-1644
> Project: Falcon
> Issue Type: Bug
> Components: retention
> Affects Versions: 0.8
> Reporter: Balu Vellanki
> Assignee: Balu Vellanki
> Fix For: 0.9
>
> Attachments: FALCON-1644.patch
>
>
> ​Here is a sample feed xml.
> {code}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <feed name="rawEmailFeed" description="Raw customer email feed"
> xmlns="uri:falcon:feed:0.1">
> <tags>externalSystem=USWestEmailServers</tags>
> <groups>churnAnalysisDataPipeline</groups>
> <frequency>hours(1)</frequency>
> <timezone>UTC</timezone>
> <late-arrival cut-off="hours(1)"/>
> <clusters>
> <cluster name="primaryCluster" type="source">
> <validity start="2015-10-30T01:00Z" end="2015-10-30T10:00Z"/>
> <retention limit="hours(10)" action="delete"/>
> </cluster>
> </clusters>
> <locations>
> <location type="data"
> path="/user/ambari-qa/falcon/demo/primary/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
> <location type="stats" path="/"/>
> <location type="meta" path="/"/>
> </locations>
> <ACL owner="ambari-qa" group="users" permission="0x755"/>
> <schema location="/none" provider="/none"/>
> </feed>
> {code}
> In the above example, the validity time is "the time interval when the feed
> is valid on this cluster". After the validity time ends, falcon is not
> expected to perform any operations on the feed. The retention job for this
> feed will be run from validity start time up to validity end time, and will
> delete any feed instances older than 10 hours. Some instances of Feed will
> never be deleted. In the above example, feed instances at between
> 2015-10-30T00:00Z and 2015-10-30T10:00Z will never be deleted.
> Ideally, the retention coordinator job should run from "validity start time"
> up to "validity end time + retention age limit" to ensure all instances are
> handled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)