One comment on all of this: I'm not a fan of a _requiring_ separate api server component to be run - for smaller deployments (of which there are a lot!) this increases the memory requirements by a noticeable fraction.
-ash On 19 January 2020 19:58:34 GMT+03:00, Jarek Potiuk <jarek.pot...@polidea.com> wrote: >Some comments from my side: > >Bolke: > >Direct database access for users is a no go. Direct queries that in >anyway >> manipulate the database or any state in Airfl ow (trigger dags, >connections >> etc) from the CLI should be prevented. > > >Absolutely agree in principle. But we have to discuss and accept >profound >consequences >of such change. For me - this also means that those commands cannot be >run >on the machine where airflow is installed and CLI users should have no >access to those >machines - otherwise they can run arbitrary python method/class script >using airflow and >access the DB directly (it also means that there direct DB connection >should not be allowed) >This pretty much calls for either using scheduler as API server or >introduce a new >"API-server" entity. > >This is quite a departure from the current Airflow deployment model and >I >think it should be >"very breaking change" (architecturally/deployment wise). It could >still >look like a local client for >the user - so it's not a breaking change from user point of view. > >This also means that we do not need maybe 95% of airflow code and that >we >should either >not use "airflow" package (airflow-cli maybe?), or do something with >the >"__init__.py" of >the 'airflow' package. Currently we have a very bad import model if we >want >to use "airflow" package >for everything. By doing 'import airflow.<whatever>" we are pulling in >almost all airflow code >because it is either directly or indirectly imported in the >airflow/__init__.py. Including DB >models. If we stick to having this __init__.py it means that we will >continue to have many >seconds of delays for starting up the CLI. I think it is very bad to >have >seconds delays. > >For me it really calls even for another package ("airflow-cli") that >should >be installable >separately, Surely for local installation/development etc. both could >be >installed alongside and look >and behave as the current airflow does - and use localhost for >communication (and even we >should have the same backwards-compatible "airflow" binary). We could >gradually migrate all >the current CLI methods to switch to the new APIs. > >I fundamentally disagree with your position that remote working with >> Airflow should be done through the Web UI. > > >I agree. As I see it - all the modern tools for cloud world have all >three >modes for management - >Web UI (mostly used for reading and discovery), CLI - mostly used for >management and sometimes >for querying, and REST-ish APIS (for whatever custom integration is >needed). Look at docker, gcloud >and many others. From our survey - it's clear that this is also similar >in >our case. UI is used for monitoring >accessing logs etc more frequently than for management, and CLI - while >used less frequently in general >has a strong usage as well. > >On your points: >> >> 1-3: Solved by using FAB >> > >Indeed those three points might be solved by using FAB REST API. I am >not >100% sure if FAB is the best >choice (although I think it is a clear contender). What I really like >about >it that we seem to have a way to >use the same mechanism for authorisation/authentication as in UI. Using >a >standard approach here seems >like a good idea, especially that out-of-the-box we can use all the >"standard" authentication >mechanisms: >https://flask-appbuilder.readthedocs.io/en/latest/security.html# . >(DB, OpenAI, LDAP, Oauth, custom) - >they are all available out-of-the-box. What I also like is that FAB >REST >API has everything you expect from >a modern API ( >https://flask-appbuilder.readthedocs.io/en/latest/rest_api.html#openapi-spec >) >It implements "open-api" standard, has built-in swagger integration and >the >capability of not only generating the >documentation, having a swagger test server for API testing but also >capability of automatically generating >clients in different languages (I'd love however to see POC showing all >that capabilities). >IMHO any proposal for API implementation will have to have all those >properties and >FAB is hard to beat on that especially that we already use it.. > > >> 4. You main points are solved by using FAB. On B) I don’t think it is >a >> good idea to have an extra layer between json_client and >local_client. Both >> are pretty thin wrappers around `airflow.api.common.*`. >> > >I think we should get rid of the local client asap and introduce only >REST >API and API server. It's inevitable >IMHO and the sooner we introduce it - the better. This way we could >immediately compress the layers there >to 1 layer. > > >> 5. Non existent ;-) >> >Off-by-one indeed :) > > >> 6. I do agree that there is some code duplication going on, but its >> intention was there in order to have it in place until `local_client` >> wasn’t required anymore and the Rest API is mature. So mature the >API, get >> rid of local_client. >> > >I think we should do it ASAP and get rid of the local client (but >provide >equivalent binary CLI to do >the same that current CLI does). Once we remove the local >implementation we >should switch all the >CLI methods to use the API and eventually decouple it from main airflow >code once it is done. > >7. Having the CLI manipulate `airflow.cfg` is fine as long as it checks >if >> it has the right permissions to do so, the OS will enforce that >anyway. >> >Agree. > > >> 8. They are not in the same namespace. `airflow.api.client.*` vs >> `airflow.www.api`. If you mean by ‘package’ ‘Airflow’, then I suggest >maybe >> packaging the API/Client separately of break up the Airflow package >in >> “airflow-common”, “airflow-client”, “airflow-server”. > >Same thought exactly ^^^ > > >> 9. That’s not unexpected: the interface is documented whilst not >perfectly. >> The CLI *should* use the public API. As in eat you own dog food. The >(json) >> client api resides in its own namespace which does *not* import >Airflow at >> all. So interfacing clients should not use the local_client. >> > >Unfortunately this is not true - due to the way our __init_.py in >airflow >package is defined, >pretty much everything in airflow has an implicit dependency on pretty >much >all the code. See above. >Even if there are no "explicit" imports, just startup time is something >that we cannot do anything about. > >In my opinion you are mixing several issues. The CLI has its issues, >the >> API has its issues and maybe our packaging has its issues. We should >> address them separately. >> > >I think Kamil is right here in mixing the issues. They are already >mixed >unfortunately. We need to decide now >what is the target we want to achieve (separate api server? Separate >client? Separate binaries? Separate python packages? >Separate pypi packages?). Even if we do not implement it all and go >through >a transition period, we need to know where we >are heading. Unfortunately we have this nasty airflow.__init__.py >problem >that I think severely hampers any decoupling effort >we want to do - client/server and we need to decide first what we want >to >do and how to decouple (in the future) and define >our roadmap how to get there and where to break backwards >compatibility. > > > >Kamil: > > >> I don't like the current remote mode in CLI for several reasons: >> 1) The default API security configuration makes all API data public >> 2) We do not provide sufficient security mechanisms to secure the >API. >> 3) The current API does not allow you to limit permissions so that >the >> user can only use the selected feature. If the user has access to the >> API, he can do anything. >> > >I think we should not be put-off by current API implementation, It was >experimental after all. I think we have three options >- implement completely new architecture/deployment approach >- promote the current API to be "official" by adding the above >- implement some kind of mixture with using the same technology - >FAB but more of the features (open-api) and changing architecture to be >more of a client-server separation (my preference). > > >> 4) An abstraction leak between json_client and local_client. >> a) The API Clients have no defined exceptions. >> b) The API Clients have no defined return types >> > >I think this can all be handled by adding it. And FAB's OpenApi + >swagger >specs is one of the more common/standard way of defining those, where >we >have documentation, generating clients in multiple languages etc. > > >> 6) This forces duplicate code. This is dangerous because one method >> may not be sufficiently tested and thus behave unexpectedly. >> > >I agree we should not have duplicated code. We should remove local >client. > >7) Not everything should be possible via the API. e.g. Tomek Urbaszek >> is now working to add new commands that allow editing of the >> airflow.cfg file. >> PR: https://github.com/apache/airflow/pull/7130 > > >I think this depends on the general architecture we come up with. If we >decide >to come up with clear client/server split, the API for editing >configuration >will be the ONLY way to change it and it will be only available >remotely. >I think - similarly to Bolke - that in the "production" deployments, >even the admins should not have access to the DB of Airflow, nor to >configuration >of Airflow. It should all be protected via API access mechanisms. > > >> 8) We do not have sufficient separation of the API client and API >> server. The API client and API server are in the same python package. >> This seriously threatens the interoperability of this API. It's now >> very easy to build an API that will return data in an unexpected >> format. >> > >Agree - what we have now is bad and can be easily mixed and abused. >I think we should agree on the level of separation (python package/ >pypi package/ client-server deployment) we want to have and base all >our >further decisions on that. > >9) This means that people who want to use the API will only use the >> specified client API. The same API client that is used by the CLI, >> which is very unexpected because it has severe limitations. >> > >Indeed. But I think we should remove those imitations. > > >> I suggest that we do not develop remote mode in current CLI, and even >> delete it in fear future. > > >I think quite the opposite - we should turn all our interactions into >remote ones. >It might be that you want to achieve the same eventually. Maybe we talk >about the same >but look at it from a different perspective and different sequence of >reaching the target. >We need to talk a bit more about it to really understand whether we are >not >talking about >the same long term goal - and then agree on what will be the technology >to >get us there >and agree the way how to get there. > > >> a) Build an API that meets the basic requirements: > > >Yes. But I think it can be achieved by improving and extending features >of >current API. I think >it is a matter of technology choice and what steps we are going to >take. > > >> b) Build an API client. It is best if the client is generated >> automatically. >> >Yep. That's what FAB + OpenAPI + Swagger is very likely to deliver. We >do >not need to do it >ourselves - this is a mostly solved problem. > > >> c) We should build a CLI that provides access to the API in a secure >> manner. >> >Agree. And again - this is what FAB + Open API seems to deliver pretty >much >out-of-the box. > > >> Of course, we can try to build temporary solutions if the users need >> it, but this should be very limited and should have a warning about >> the level of stability and security. So if the user needs it can add >> commands that support remote mode, but the user should be aware that >> this is not a stable and secure solution. >> > >I think it's already there -> "experimental" is a clear indication that >this will likely >change in the future. > > >> I have plans for Q1 to build a stable API that does not have the >above >> issues. > > >I think we need that fairly fast - for Airflow 2.0. And I am pretty >convinced that if we use some existing >framework we can implement it rather fast - providing that we will >"stand >on the shoulders of giants" >and use what's already there. APIs in python programs are already >solved >problem. We do not need >to build our own solution for that - just decide on architecture/split, >plugin a framework and >implement endpoints. > > >J. > >-- > >Jarek Potiuk >Polidea <https://www.polidea.com/> | Principal Software Engineer > >M: +48 660 796 129 <+48660796129> >[image: Polidea] <https://www.polidea.com/>