potiuk commented on issue #18816:
URL: https://github.com/apache/airflow/issues/18816#issuecomment-938336552


   I think without the larger redesign, the backfill API is not too much useful 
- and even I'd argue current API has everything (or most of) what you need to 
be able to do the backfill already (but here I might be mistaken).
   
   I imagine two ways of doing backfill (and by backfill I understand clening 
and re-running of series of historical dag runs - posibly for only subset of 
tasks: certain tasks and all tasks tha depend on them. 
   
   My view on it  is that you can do it in two ways (but this would need to be 
brought to the devlist if we would like to move it forward either way - as this 
is only my opinion and I might be mistaken, maybe there are other, simpler 
ways) :
   
   1) "active" - basically replicating the way current `airflow backfill` does 
it. You have a "user controlled" entity that monitors and controls the 
backfill. In `airflow backfill` it is a process started in the terminal that 
loops through all the historical dag runs, cleans and re-runs them. This 
requires uninterrupted connection to Airflow DB from the terminal, monitoring 
and reporting the status of the jobs and active "scheduling" of tesks like if 
you manually run them. I'd argue you can do it today with the current API or 
with small additions to it (to be verified), the only missing piece is to add 
the "another client" that will do it rather than the "airflow backfill" process 
(and use the API to do the same that the `airflow backfill` does by direct DB 
access and running pieces of Airflow scheduling/dagrun code in the proces). 
That is doable, it does not change the "model" of backfil, and it allows to use 
the API rather than requiring to have the `airflow backfill` process to b
 e run somewhere where DB of airflow is directly accessible. This might be 
doable without major design/aip/changing the scheduler behaviour etc. I think.  
   
   However I'd also argue the usefulness of that is limited because you still 
need active client same way you need now. The only benefit is that you do not 
need "airflow" package installed in the client and you do not need the direct 
DB access. And if you do it only for backfill, it would be at most a tactical 
solution. 
   
   I'd say it would be much better instead (more future proof) -  to extend the 
`airflow cli` to be able to do everything currrent CLI does via API and make a 
separate `airflow-cli` package that you could install independently from 
Airflow. That is someting that partially worked  in 1.10 (but it was rarely 
used and brittle) - the CLI then could use experimental API for some operations 
and perform small set of actions without the DB access. It could be done 
incrementally, starting from backfill, but I think it's worth doing it with the 
"Remote airflow CLI" as a goal not just backfill - then it makes sense I think 
and might be a very good "strategic" direction.
   
   2) passive - you submit "BackfillJob"s via API (and there are  API calls 
that can check the progress). Then in order to perform the backfill you must 
have a component (could be aither modified scheduler or separate component) 
that continuously runs, executes and monitors the backfills and you also need 
to have a UI to webserver to monitor, possibly re-run the Backfill Jobs. This 
is a much bigger effort that requires archuitectural changes in the way how 
scheduler operates, or - more likely - implementing another scheduler-like 
component that would manage and control such backfills. I believe (@ashb?) the 
current scheduler is heavily optimized in the way that it will be difficult to 
make it runs and control such Backfill jobs, so having a separate component 
might make more sense.
   
   We'd need DB modification to keep status and monitor the backfill and UI 
interface to view and monitor them. This is the "ultimate" backfill solution 
that might make backfill a first-class-citizen. But the effort required here is 
much bigger + it has some connected components that will need to be updated 
(Helm Chart for one, documentation on how to run and install Airflow, Docker 
Compose quick start etc. etc. ) - similar set of changes that were required 
when we added the "triggerer" for Defferable Tasks for the upcoming 2.2. But 
again - if we would like to discuss the way how to approach it - some proposal 
will have to be brought to the devlist so that others have a chance to take 
part in the discussion. Improving Backfill is one of those "important" but not 
"urgent" things and any change in the approach or changing the CLI to be able 
to use the API, needs to be raised there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to