Hey All, Very detailed discussion, Andrey and others. I am hearing everyone out here and I kinda liked the idea from Jarek about prefixing with "airflowconnections+".
I share everyone's problem here because I have been trying to get Hive HA connection type work which is of the format: jdbc:hive2://host1:port1,host2:port2,host3:port3/default;principal=hive/host@PRINCIPAL;serviceDiscoveryMode=zooKeeper;ssl=true;zooKeeperNamespace=hiveserver2 iand our connection parser is not processing it the way it should. Instead of deprecating it, I would like to go all in into *improving* it to make it more robust. Reading along the line, I can see a few great suggestions. I think the concept of adding a more flexible schema to our connections model would be a great idea. I understand that it comes with a lot of breaking changes and probably cannot make it into 2.x Airflow, but this might be the way to go. Also, another thought is to add a "protocol" field to the connections model so that somewhat more complex uri's such as some postgrest connections etc using JDBC can be constructed on the fly in our backend. I think this could solve some of the issues we face like: https://github.com/apache/airflow/issues/33442. These are a few of my thoughts, glad to finally see someone bring this up! Thanks & Regards, Amogh Desai On Sun, Nov 19, 2023 at 7:22 AM Andrey Anshin <andrey.ans...@taragol.is> wrote: > Hey guys! Thanks for all your feedback and suggestions. > > I see here couple of examples with postgresql, however I would like to > remind that postgresql it is not a valid connection type which doesn't > register into the providers manager, and we have an exclusive hack to turn > this connection type into the valid Airflow Postgres Connection type, see: > > https://github.com/apache/airflow/blob/71b512d314e14b2e0904f2d3a8ca3b00aa21fdc7/airflow/models/connection.py#L185-L187 > As far as I remember all attempts to extend this hack by additional types > were rejected. > > I also do not think that the idea of extending connection type by plus sign > would succeed. It just increase complexibility of URI parser, required to > support it until Airflow 3 and we also need to cover some cases like > airflow+postgresql+pg8000://scott:tiger@localhost/test, just because we > can't handle this cases. > What makes me think this? Well our past experience shows that > extending/fixing URI parsing breaks something else. The Issue #33442 shows > an attempt to deal with `:` without encoding strings; it is required to add > 100 lines of code just to avoid limitations of URI and our parser. > > If we would like to compare with SQLAlchemy it supports both URI and > keyword arguments during creation Engines, and for some engines or > parameters (e.g. pools) it is strictly required to set them through keyword > arguments only. That is in contrast to Airflow Connections where all > parameters are sent by a single blob of the data: retrive from DB, deser > URI, deser JSON. If we would like to draw an analogy with browser > connection, then SA URI is a POST request and Airflow Connection is a GET > request. > > I still think different people have different expectations of the Airflow > URI format. My expectation is that this is only a SerDe mechanism around > Airflow Connection. If we have a look at the codebase of Airflow we found > that this is not always true and we try to use it as SA URI. > > If someone would like to extend a brainstorm I would like to suggest > reworking our Connections and making ConnectionsV2, in the future we will > definitely find that this approach is wrong and create ConnectionsV3. If > not interesting, you could stop at that point. > > What is the connection right now? It is just set of conn_id, conn_type, > description, host, login, password, schema, port and plus extra for avoid > limitations of other parameters > > Are there any connections using these fields? No. > Are some connections expected to have a token/secret instead of password? > Yes > Are some connections expecting URI instead of host? Yes > Is it confusing users? Yep, regularly in discussions/issues/slack > Are connection types really matter? Only for DBApiHooks > Are connections in the UI consistent with these fields? No > Do we have a good mechanism for auto documentation? No > Do we have a mechanism for validation parameters? Only in the UI and > limited by extra fields. > Do connections provide different auth methods which are mutually exclusive? > Oh yeeeeah! > Is there any way to slightly change the connection from the user side? No, > you should create your own Hook which uses its own Operators. > > Let's introduce some old suggestions (only highlights) > 1. Instead of predefined fields make a flexible schema, e.g. replace all > existing fields except conn_id, description to the JSON object into the DB. > 2. Flexible schema doesn't mean that it accept everything, so we need to > validate it, e.g. JSONSchema > 3. Decouple ConnectionsV2 from the Hooks, and replace "conn_type" by > "implements". What does it mean? Hooks do not directly obtain raw > connection, instead of that Airflow retrieve connection, validate with a > schema, is it valid "implements" and create some pydantic/dataclass object > and provide it into the Hook. What does this object also contain? Basically > creation of the client or some similar. > 4. Multiple different ConnectionsV2 might implement the same "implements" > and be registered into the provider manager. > 5. Auto Generate documentation from the pydantic objects (or similar) > 6. Use schema for generate fields in the UI and validate inputs > 7. Since we have multiple implementations, each new connection with the > same "implements" would have multiple options, which could be selected by > drop down list. I think it is more handy than extra fields with mutually > exclusive fields. > 8. Maybe even subtypes. > > Probably if you still read this you might have a question: "Ok, but how is > it related to Airflow URI?" > Let return to expectations, user want to use " > https://example.org/foo/bar/spam/egg", let's give this opportunity to > create own ConnectionV2 (and maybe contribute it back) which accept simple > json > > {"uri": "some-awesome-uri"} > > And that all, want to provide your URI which is not related to Airflow URI? > Implement new Connection, you do not need to rewrite all Hooks/Operators. > Want to implement something insecure? Well implement it, deploy into the > Airflow cluster. > > And disadvantages: > - A lot of rework > - A lot of backcompat code, e.g. Hooks should works with current Connection > as well as ConnectionV2 until miracle happen > - A lot of pain. > > What happens with Airflow Connection URI format in this case? It do not > required, hard to implement due to flexible schema and should work only > with current Connection but not with V2 > > That is just a raw idea and to be honest I think it is hardly possible to > implement this one in Airflow 2.x > > ---- > Best Wishes > *Andrey Anshin* > > > > On Sun, 19 Nov 2023 at 02:25, utkarsh sharma <utkarshar...@gmail.com> > wrote: > > > I do think the URI format for airflow connections also serves the purpose > > of conveying the intent, similar to how we URI in a browser to connect > to a > > website. We use Ariflow Connection URI to connect to one of the services > > supported by Airflow. Due to this reason, I lean toward Jarek's proposal > of > > adding "airflowconnection+" or just "airflow+" prefix to the schema, > which > > should also solve its misinterpretation as a common URI. > > Also, I share the concerns regarding the breaking changes. > > > > So I would like to vote -1. > > > > Thanks, > > Utkarsh Sharma > > > > > > On Sun, Nov 19, 2023 at 3:21 AM Bolke de Bruin <bdbr...@gmail.com> > wrote: > > > > > So -1 on deprecating. I think we should do the opposite and go all in. > > The > > > suggestion by Jarek makes sense in this case, as it remains and is > fully > > > standards compliant. The JSON representation isn't: it is our own > crafted > > > format. > > > > > > My 2 cents, > > > > > > Bolke > > > > > > On Sat, 18 Nov 2023 at 22:15, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > > I agree with Ash that it's much easier to define a lot of simpler > > > > connections this way. But it is also very confusing the way now how > > you > > > > can mix the "real" url with "airflow connection" URL. > > > > And Daniel is very right about the magnitude of breaking change. > > > > > > > > But possibly there is a way to eat cake and have it too. > > > > > > > > Answering Andrey's thinking: 1) I propose to deprecate it (raise > > warning) > > > > and even remove it from documentation > > > > But following Daniel's "pain": 2) possibly we should not even > schedule > > > it > > > > for removal for Airflow 3 > > > > And answering Ash's concern: 3) we should likely figure a > non-confusing > > > way > > > > how to define connection in "airlfow way". > > > > > > > > We could be following SQLlAlchemy approach - and define connection > with > > > > "airflowconnection+" prefix in the schema > > > > > > > > AIRFLOW_CONN_MY_DB=airflowconnection+postgresql://user:pass@host > > > > > > > > I think this would remove all the confusion, keep an easy way of > > defining > > > > connection but also avoid "heavily" breaking change. > > > > > > > > J. > > > > > > > > > > > > > > > > On Sat, Nov 18, 2023 at 5:41 PM Daniel Standish > > > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > > > > > The thing that makes *me* hesitant to deprecate is the sheer > > magnitude > > > of > > > > > breaking it would bring (even though we're only talking about a > > > > > hypothetical 3.0 release), balanced against the actual pain it > > causes. > > > > > I.e. it's confusing to use, and takes up space in docs (when, if > > > removed, > > > > > we could just show one good way of doing things), but, it doesn't > > > create > > > > > that much of a maintenance burden anymore, and has certainly become > > > more > > > > > reliable and easy to use over time (e.g. since we added get_uri). > > > > > > > > > > > > > > > On Sat, Nov 18, 2023 at 3:30 AM Ash Berlin-Taylor <a...@apache.org> > > > > wrote: > > > > > > > > > > > -1 — for the simple case the URI format is perfect: > > > > > > > > > > > > AIRFLOW_CONN_MY_DB=postgresql://user:pass@host > > > > > > > > > > > > We can push JSON format for more complex cases, and/or limit URI > to > > > > only > > > > > > the simple cases as long as we don’t remove it entirely. > > > > > > > > > > > > -ash > > > > > > > > > > > > > On 17 Nov 2023, at 22:44, Andrey Anshin < > > andrey.ans...@taragol.is> > > > > > > wrote: > > > > > > > > > > > > > > Greetings! > > > > > > > > > > > > > > I want to propose to deprecate Airflow Connection URI > > > representation > > > > > and > > > > > > > remove it in Airflow 3 in favor of the already existing > > > replacement - > > > > > > JSON > > > > > > > representation. > > > > > > > > > > > > > > In the past URI representation helped to add one of the awesome > > > > > features > > > > > > - > > > > > > > Alternative Secrets Backends: Environment Variables, Files or > > > > Provider > > > > > > > specific Backends. > > > > > > > > > > > > > > However I have a feeling that nowadays it have some sort of > > > > limitations > > > > > > due > > > > > > > to the parser, see: > > > > > > > - https://github.com/apache/airflow/issues/33442 > > > > > > > > > > > > > > Or miss interpretation that Airflow connection it is something > > more > > > > > than > > > > > > > just representation of Airflow Connection and it is a the > > > replacement > > > > > to > > > > > > > any other type of URIs such as: HTTP, Postgres URIs, SQLAlchemy > > > URI, > > > > > > Spark > > > > > > > URI and etc > > > > > > > > > > > > > > And attempt to fix it > > > > > > > - https://github.com/apache/airflow/issues/10913 > > > > > > > - https://github.com/apache/airflow/pull/27796 > > > > > > > - https://github.com/apache/airflow/pull/35712 > > > > > > > > > > > > > > And attempt to use it in DbApiHooks, for detail see: > > > > > > > - > > https://lists.apache.org/thread/8rhmz3qh30hvkondct4sfmgk4vd07mn5 > > > > > > > > > > > > > > In addition, complicated Connection in the URI format is not > > human > > > > > > > readable, requires additional decoding, and it is hard to > create > > it > > > > > > > manually. > > > > > > > > > > > > > > The only one beneficial part of URI representation is ability > to > > > > create > > > > > > it > > > > > > > programatically > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#generating-a-connection-uri > > > > > > > , there is no such of method for generate JSON, but I don't see > > the > > > > > > problem > > > > > > > to create such of method in Connection object which return > string > > > > > > > representation of JSON connection. > > > > > > > > > > > > > > So my suggestion would be pretty simple: > > > > > > > 1. Deprecate accept URI as parameter to the Connection object > > > > > > > 2. Deprecate uri property > > > > > > > 3. Replace get_uri by get_json > > > > > > > 4. Replace all URI examples in Providers Connections by JSON > > > > > alternatives > > > > > > > > > > > > > > > > > > > > > WDYT? Maybe I miss something and we should keep Airflow > > Connection > > > as > > > > > URI > > > > > > > in the future Airflow's major versions and support both ways to > > > > provide > > > > > > > connections as JSON and URI. > > > > > > > > > > > > > > ---- > > > > > > > Best Wishes > > > > > > > *Andrey Anshin* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -- > > > Bolke de Bruin > > > bdbr...@gmail.com > > > > > >