[ 
https://issues.apache.org/jira/browse/FALCON-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336792#comment-15336792
 ] 

Venkatesan Ramachandran commented on FALCON-2030:
-------------------------------------------------


[~ajayyadava]

Let's assume that feed A is pointing to a dir /basedir/feedA 
Metadata gets exported and stored as a file (for simplicity we assume 1 file) 
in the dir as /basedir/feedA/datafile-t1

The consumers (Pig or MR job) takes feed A as input and reads from the dir 
/basedir/feedA/ and so the file /basedir/feedA/datafile-t1

After some days, the metadata changes and a new export happens that produces a 
new file under the feed dir as /basedir/feedA/datafile-t2

Now there are two files - one with slightly oder data and the other one with 
updated data as below
/basedir/feed/datafile-t1
/basedir/feed/datafile-t2

Let's assume that the custom has implemented a custom retention that retires 
all the files except the last one (and the retention job runs once a day)

At this point, 

a) the workflow (pig/mr etc) will consume both the files (duplicate data)
    If I read your comment above correctly, you are suggesting to consume only 
the latest file. 
    This would require developing custom pig loaders and input formats etc and 
is not very common and error prone.

b) In the absence of (a), when the workflow consumes both the files under the 
feed dir and if the retention deletes the older one, the Pig or MR task will 
try to read the file and fail.

It is better to write the files under a <version or pattern> subdir and apply 
custom retention (based on access time etc) to retire that dir. 
The workflow can easily use the LATEST EL to safely access the latest <pattern 
dir>. This seems to be a more plausible use-case IMO.

With this regard, I do not believe this validation is restricting any use 
cases. In fact, I think it makes users avoid pitfalls. 




 

> Enforce time partition pattern in the data location path in feed definition 
> ----------------------------------------------------------------------------
>
>                 Key: FALCON-2030
>                 URL: https://issues.apache.org/jira/browse/FALCON-2030
>             Project: Falcon
>          Issue Type: Improvement
>          Components: feed
>            Reporter: Venkatesan Ramachandran
>            Assignee: Venkatesan Ramachandran
>
> In feed definition, data location can be specified without time series 
> pattern like below:
>    <locations>
>         <location type="data" 
> path="/tmp/falcon-regression/RetentionTest/testFolders/"/>
>         <location type="stats" path="/projects/falcon/clicksStats"/>
>         <location type="meta" path="/projects/falcon/clicksMetaData"/>
>     </locations>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to