Hey All,

Very detailed discussion, Andrey and others.
I am hearing everyone out here and I kinda liked the idea from Jarek about
prefixing with "airflowconnections+".

I share everyone's problem here because I have been trying to get Hive HA
connection type work which is of the format:
jdbc:hive2://host1:port1,host2:port2,host3:port3/default;principal=hive/host@PRINCIPAL;serviceDiscoveryMode=zooKeeper;ssl=true;zooKeeperNamespace=hiveserver2
iand our connection parser is not processing it the way it should.

Instead of deprecating it, I would like to go all in into *improving* it to
make it more robust. Reading along the line, I can see a few great
suggestions.

I think the concept of adding a more flexible schema to our connections
model would be a great idea. I understand that it comes with a lot of
breaking changes and probably cannot make it into 2.x Airflow, but this
might be the way to go.

Also, another thought is to add a "protocol" field to the connections model
so that somewhat more complex uri's such as some postgrest connections etc
using JDBC can be constructed on the fly in our backend. I think this could
solve some of the issues we face like:
https://github.com/apache/airflow/issues/33442.

These are a few of my thoughts, glad to finally see someone bring this up!

Thanks & Regards,
Amogh Desai

On Sun, Nov 19, 2023 at 7:22 AM Andrey Anshin <andrey.ans...@taragol.is>
wrote:

> Hey guys! Thanks for all your feedback and suggestions.
>
> I see here couple of examples with postgresql, however I would like to
> remind that postgresql it is not a valid connection type which doesn't
> register into the providers manager, and we have an exclusive hack to turn
> this connection type into the valid Airflow Postgres Connection type, see:
>
> https://github.com/apache/airflow/blob/71b512d314e14b2e0904f2d3a8ca3b00aa21fdc7/airflow/models/connection.py#L185-L187
> As far as I remember all attempts to extend this hack by additional types
> were rejected.
>
> I also do not think that the idea of extending connection type by plus sign
> would succeed. It just increase complexibility of URI parser, required to
> support it until Airflow 3 and we also need to cover some cases like
> airflow+postgresql+pg8000://scott:tiger@localhost/test, just because we
> can't handle this cases.
> What makes me think this? Well our past experience shows that
> extending/fixing URI parsing breaks something else. The Issue #33442 shows
> an attempt to deal with `:` without encoding strings; it is required to add
> 100 lines of code just to avoid limitations of URI and our parser.
>
> If we would like to compare with SQLAlchemy it supports both URI and
> keyword arguments during creation Engines, and for some engines or
> parameters (e.g. pools) it is strictly required to set them through keyword
> arguments only. That is in contrast to Airflow Connections where all
> parameters are sent by a single blob of the data: retrive from DB, deser
> URI, deser JSON. If we would like to draw an analogy with browser
> connection, then SA URI is a POST request and Airflow Connection is a GET
> request.
>
> I still think different people have different expectations of the Airflow
> URI format. My expectation is that this is only a SerDe mechanism around
> Airflow Connection. If we have a look at the codebase of Airflow we found
> that this is not always true and we try to use it as SA URI.
>
> If someone would like to extend a brainstorm I would like to suggest
> reworking our Connections and making ConnectionsV2, in the future we will
> definitely find that this approach is wrong and create ConnectionsV3. If
> not interesting, you could stop at that point.
>
> What is the connection right now? It is just set of conn_id, conn_type,
> description, host, login, password, schema, port and plus extra for avoid
> limitations of other parameters
>
> Are there any connections using these fields? No.
> Are some connections expected to have a token/secret instead of password?
> Yes
> Are some connections expecting URI instead of host? Yes
> Is it confusing users? Yep, regularly in discussions/issues/slack
> Are connection types really matter? Only for DBApiHooks
> Are connections in the UI consistent with these fields? No
> Do we have a good mechanism for auto documentation? No
> Do we have a mechanism for validation parameters? Only in the UI and
> limited by extra fields.
> Do connections provide different auth methods which are mutually exclusive?
> Oh yeeeeah!
> Is there any way to slightly change the connection from the user side? No,
> you should create your own Hook which uses its own Operators.
>
> Let's introduce some old suggestions (only highlights)
> 1. Instead of predefined fields make a flexible schema, e.g. replace all
> existing fields except conn_id, description to the JSON object into the DB.
> 2. Flexible schema doesn't mean that it accept everything, so we need to
> validate it, e.g. JSONSchema
> 3. Decouple ConnectionsV2 from the Hooks, and replace "conn_type" by
> "implements". What does it mean? Hooks do not directly obtain raw
> connection, instead of that Airflow retrieve connection, validate with a
> schema, is it valid "implements" and create some pydantic/dataclass object
> and provide it into the Hook. What does this object also contain? Basically
> creation of the client or some similar.
> 4. Multiple different ConnectionsV2 might implement the same "implements"
> and be registered into the provider manager.
> 5. Auto Generate documentation from the pydantic objects (or similar)
> 6. Use schema for generate fields in the UI and validate inputs
> 7. Since we have multiple implementations, each new connection with the
> same "implements" would have multiple options, which could be selected by
> drop down list. I think it is more handy than extra fields with mutually
> exclusive fields.
> 8. Maybe even subtypes.
>
> Probably if you still read this you might have a question: "Ok, but how is
> it related to Airflow URI?"
> Let return to expectations, user want to use "
> https://example.org/foo/bar/spam/egg";, let's give this opportunity to
> create own ConnectionV2 (and maybe contribute it back) which accept simple
> json
>
> {"uri": "some-awesome-uri"}
>
> And that all, want to provide your URI which is not related to Airflow URI?
> Implement new Connection, you do not need to rewrite all Hooks/Operators.
> Want to implement something insecure? Well implement it, deploy into the
> Airflow cluster.
>
> And disadvantages:
> - A lot of rework
> - A lot of backcompat code, e.g. Hooks should works with current Connection
> as well as ConnectionV2 until miracle happen
> - A lot of pain.
>
> What happens with Airflow Connection URI format in this case? It do not
> required, hard to implement due to flexible schema and should work only
> with current Connection but not with V2
>
> That is just a raw idea and to be honest I think it is hardly possible to
> implement this one in Airflow 2.x
>
> ----
> Best Wishes
> *Andrey Anshin*
>
>
>
> On Sun, 19 Nov 2023 at 02:25, utkarsh sharma <utkarshar...@gmail.com>
> wrote:
>
> > I do think the URI format for airflow connections also serves the purpose
> > of conveying the intent, similar to how we URI in a browser to connect
> to a
> > website. We use Ariflow Connection URI to connect to one of the services
> > supported by Airflow. Due to this reason, I lean toward Jarek's proposal
> of
> > adding "airflowconnection+" or just "airflow+" prefix to the schema,
> which
> > should also solve its misinterpretation as a common URI.
> > Also, I share the concerns regarding the breaking changes.
> >
> > So I would like to vote -1.
> >
> > Thanks,
> > Utkarsh Sharma
> >
> >
> > On Sun, Nov 19, 2023 at 3:21 AM Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >
> > > So -1 on deprecating. I think we should do the opposite and go all in.
> > The
> > > suggestion by Jarek makes sense in this case, as it remains and is
> fully
> > > standards compliant. The JSON representation isn't: it is our own
> crafted
> > > format.
> > >
> > > My 2 cents,
> > >
> > > Bolke
> > >
> > > On Sat, 18 Nov 2023 at 22:15, Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > > I agree with  Ash that it's much easier to define a lot of simpler
> > > > connections this way. But it is also very confusing the way now  how
> > you
> > > > can mix the "real" url with "airflow connection" URL.
> > > > And Daniel is very right about the magnitude of breaking change.
> > > >
> > > > But possibly there is a way to eat cake and have it too.
> > > >
> > > > Answering Andrey's thinking: 1) I propose to deprecate it (raise
> > warning)
> > > > and even remove it from documentation
> > > > But following Daniel's "pain": 2)  possibly we should not even
> schedule
> > > it
> > > > for removal for Airflow 3
> > > > And answering Ash's concern: 3) we should likely figure a
> non-confusing
> > > way
> > > > how to define connection in "airlfow way".
> > > >
> > > > We could be following SQLlAlchemy approach - and define connection
> with
> > > > "airflowconnection+" prefix in the schema
> > > >
> > > > AIRFLOW_CONN_MY_DB=airflowconnection+postgresql://user:pass@host
> > > >
> > > > I think this would remove all the confusion, keep an easy way of
> > defining
> > > > connection but also avoid "heavily" breaking change.
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > > On Sat, Nov 18, 2023 at 5:41 PM Daniel Standish
> > > > <daniel.stand...@astronomer.io.invalid> wrote:
> > > >
> > > > > The thing that makes *me* hesitant to deprecate is the sheer
> > magnitude
> > > of
> > > > > breaking it would bring (even though we're only talking about a
> > > > > hypothetical 3.0 release), balanced against the actual pain it
> > causes.
> > > > > I.e. it's confusing to use, and takes up space in docs (when, if
> > > removed,
> > > > > we could just show one good way of doing things), but, it doesn't
> > > create
> > > > > that much of a maintenance burden anymore, and has certainly become
> > > more
> > > > > reliable and easy to use over time (e.g. since we added get_uri).
> > > > >
> > > > >
> > > > > On Sat, Nov 18, 2023 at 3:30 AM Ash Berlin-Taylor <a...@apache.org>
> > > > wrote:
> > > > >
> > > > > > -1 — for the simple case the URI format is perfect:
> > > > > >
> > > > > > AIRFLOW_CONN_MY_DB=postgresql://user:pass@host
> > > > > >
> > > > > > We can push JSON format for more complex cases, and/or limit URI
> to
> > > > only
> > > > > > the simple cases as long as we don’t remove it entirely.
> > > > > >
> > > > > > -ash
> > > > > >
> > > > > > > On 17 Nov 2023, at 22:44, Andrey Anshin <
> > andrey.ans...@taragol.is>
> > > > > > wrote:
> > > > > > >
> > > > > > > Greetings!
> > > > > > >
> > > > > > > I want to propose to deprecate Airflow Connection URI
> > > representation
> > > > > and
> > > > > > > remove it in Airflow 3 in favor of the already existing
> > > replacement -
> > > > > > JSON
> > > > > > > representation.
> > > > > > >
> > > > > > > In the past URI representation helped to add one of the awesome
> > > > > features
> > > > > > -
> > > > > > > Alternative Secrets Backends: Environment Variables, Files or
> > > > Provider
> > > > > > > specific Backends.
> > > > > > >
> > > > > > > However I have a feeling that nowadays it have some sort of
> > > > limitations
> > > > > > due
> > > > > > > to the parser, see:
> > > > > > > - https://github.com/apache/airflow/issues/33442
> > > > > > >
> > > > > > > Or miss interpretation that Airflow connection it is something
> > more
> > > > > than
> > > > > > > just representation of Airflow Connection and it is a the
> > > replacement
> > > > > to
> > > > > > > any other type of URIs such as: HTTP, Postgres URIs, SQLAlchemy
> > > URI,
> > > > > > Spark
> > > > > > > URI and etc
> > > > > > >
> > > > > > > And attempt to fix it
> > > > > > > - https://github.com/apache/airflow/issues/10913
> > > > > > > - https://github.com/apache/airflow/pull/27796
> > > > > > > - https://github.com/apache/airflow/pull/35712
> > > > > > >
> > > > > > > And attempt to use it in DbApiHooks, for detail see:
> > > > > > > -
> > https://lists.apache.org/thread/8rhmz3qh30hvkondct4sfmgk4vd07mn5
> > > > > > >
> > > > > > > In addition, complicated Connection in the URI format is not
> > human
> > > > > > > readable, requires additional decoding, and it is hard to
> create
> > it
> > > > > > > manually.
> > > > > > >
> > > > > > > The only one beneficial part of URI representation is ability
> to
> > > > create
> > > > > > it
> > > > > > > programatically
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#generating-a-connection-uri
> > > > > > > , there is no such of method for generate JSON, but I don't see
> > the
> > > > > > problem
> > > > > > > to create such of method in Connection object which return
> string
> > > > > > > representation of JSON connection.
> > > > > > >
> > > > > > > So my suggestion would be pretty simple:
> > > > > > > 1. Deprecate accept URI as parameter to the Connection object
> > > > > > > 2. Deprecate uri property
> > > > > > > 3. Replace get_uri by get_json
> > > > > > > 4. Replace all URI examples in Providers Connections by JSON
> > > > > alternatives
> > > > > > >
> > > > > > >
> > > > > > > WDYT? Maybe I miss something and we should keep Airflow
> > Connection
> > > as
> > > > > URI
> > > > > > > in the future Airflow's major versions and support both ways to
> > > > provide
> > > > > > > connections as JSON and URI.
> > > > > > >
> > > > > > > ----
> > > > > > > Best Wishes
> > > > > > > *Andrey Anshin*
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > --
> > > Bolke de Bruin
> > > bdbr...@gmail.com
> > >
> >
>

Reply via email to