[ 
https://issues.apache.org/jira/browse/FALCON-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069529#comment-15069529
 ] 

Mass Dosage commented on FALCON-1686:
-------------------------------------

This is indeed the backfill issue that [~sriksun] mentioned. Our use case is 
that we have a job that produces a data set with daily partitions and we have 
data going back 3 years. We then have a downstream process which takes these 
daily partitions as input and does further processing to produce another daily 
data set. The downstream process is usually set up with the upstream feed as 
its input and start="today(0,0)" end="today(0,0)" as the other parameters so 
when it runs it picks up new daily partitions in the upstream data set. 
Sometimes we make a code change on the job that produces the upstream data set 
and we put this live and new partitions from that point will have data produced 
with the new code. We then typically in the background would like to run this 
new code over *all* the partitions from the past - i.e. reprocess them so that 
the old data is regenerated using the new code. In some cases the newly 
produced data should also trigger an update of the downstream data but in 
certain cases not which is why I said it would be nice to be able to control 
this. We could probably live with the upstream process not triggering a 
downstream reprocess as long as it was easy to kick off a downstream reprocess 
ourselves once the upstream reprocess finished.

> Support for reprocessing
> ------------------------
>
>                 Key: FALCON-1686
>                 URL: https://issues.apache.org/jira/browse/FALCON-1686
>             Project: Falcon
>          Issue Type: Improvement
>    Affects Versions: 0.7
>            Reporter: Mass Dosage
>
> We have a number of ETL jobs which we schedule to run on a regular basis with 
> Falcon. This works fine. However, we often have cases where we need to run 
> the exact same jobs over past date ranges in order to reprocess data after a 
> code change. There doesn't seem to be any easy way to do this in Falcon at 
> the moment. Ideally we'd have a controlled way of saying "run this process 
> for dates between X and Y". There should also be a way to control whether 
> downstream processes are triggered by the data being reprocessed or not. In 
> some cases you may want downstream jobs to also run on the new data but in 
> other cases you might not. 
> With Oozie, if one wants to reprocess data from any time in history, one can 
> update the start & end-dates (using the job.properties file) and submit a new 
> coordinator to run alongside the existing one. As the coordinator-ids are 
> unique they do not clash. In Falcon, processes are defined by their readable 
> name so one would need to update that in the process file directly. 
> We are currently working around this issue by making a copy of the original 
> Falcon process, giving it a different name and changing the dates. This isn't 
> ideal and leads to a lot of XML duplication. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to