potiuk commented on issue #18816: URL: https://github.com/apache/airflow/issues/18816#issuecomment-938336552
I think without the larger redesign, the backfill API is not too much useful - and even I'd argue current API has everything (or most of) what you need to be able to do the backfill already (but here I might be mistaken). I imagine two ways of doing backfill (and by backfill I understand clening and re-running of series of historical dag runs - posibly for only subset of tasks: certain tasks and all tasks tha depend on them. My view on it is that you can do it in two ways (but this would need to be brought to the devlist if we would like to move it forward either way - as this is only my opinion and I might be mistaken, maybe there are other, simpler ways) : 1) "active" - basically replicating the way current `airflow backfill` does it. You have a "user controlled" entity that monitors and controls the backfill. In `airflow backfill` it is a process started in the terminal that loops through all the historical dag runs, cleans and re-runs them. This requires uninterrupted connection to Airflow DB from the terminal, monitoring and reporting the status of the jobs and active "scheduling" of tesks like if you manually run them. I'd argue you can do it today with the current API or with small additions to it (to be verified), the only missing piece is to add the "another client" that will do it rather than the "airflow backfill" process (and use the API to do the same that the `airflow backfill` does by direct DB access and running pieces of Airflow scheduling/dagrun code in the proces). That is doable, it does not change the "model" of backfil, and it allows to use the API rather than requiring to have the `airflow backfill` process to b e run somewhere where DB of airflow is directly accessible. This might be doable without major design/aip/changing the scheduler behaviour etc. I think. However I'd also argue the usefulness of that is limited because you still need active client same way you need now. The only benefit is that you do not need "airflow" package installed in the client and you do not need the direct DB access. And if you do it only for backfill, it would be at most a tactical solution. I'd say it would be much better instead (more future proof) - to extend the `airflow cli` to be able to do everything currrent CLI does via API and make a separate `airflow-cli` package that you could install independently from Airflow. That is someting that partially worked in 1.10 (but it was rarely used and brittle) - the CLI then could use experimental API for some operations and perform small set of actions without the DB access. It could be done incrementally, starting from backfill, but I think it's worth doing it with the "Remote airflow CLI" as a goal not just backfill - then it makes sense I think and might be a very good "strategic" direction. 2) passive - you submit "BackfillJob"s via API (and there are API calls that can check the progress). Then in order to perform the backfill you must have a component (could be aither modified scheduler or separate component) that continuously runs, executes and monitors the backfills and you also need to have a UI to webserver to monitor, possibly re-run the Backfill Jobs. This is a much bigger effort that requires archuitectural changes in the way how scheduler operates, or - more likely - implementing another scheduler-like component that would manage and control such backfills. I believe (@ashb?) the current scheduler is heavily optimized in the way that it will be difficult to make it runs and control such Backfill jobs, so having a separate component might make more sense. We'd need DB modification to keep status and monitor the backfill and UI interface to view and monitor them. This is the "ultimate" backfill solution that might make backfill a first-class-citizen. But the effort required here is much bigger + it has some connected components that will need to be updated (Helm Chart for one, documentation on how to run and install Airflow, Docker Compose quick start etc. etc. ) - similar set of changes that were required when we added the "triggerer" for Defferable Tasks for the upcoming 2.2. But again - if we would like to discuss the way how to approach it - some proposal will have to be brought to the devlist so that others have a chance to take part in the discussion. Improving Backfill is one of those "important" but not "urgent" things and any change in the approach or changing the CLI to be able to use the API, needs to be raised there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
