Re: [DISCUSSION] Add 5 new Providers to enable first-class LLMOps

Jarek Potiuk Wed, 18 Oct 2023 03:50:02 -0700

I thought a bit about it, and I think the way we have "Astronomer" behind
it, it checks all the boxes - providing that we will also have some (super
simple) dashboard similar to the MWAA one
https://aws-mwaa.github.io/open-source/system-tests/dashboard.html .

From
https://github.com/apache/airflow/blob/main/PROVIDERS.rst#rd-party-providers
:

> While we already have - historically - a number of 3rd-party service
providers managed by the community, most of those services have dedicated
teams that keep an eye on the community providers and not only take active
part in managing them (see mixed-governance model below), but also provide
a way that we can verify whether the provider works with the latest version
of the service via dashboards that show status of System Tests for the
provider. This allows us to have a high level of confidence that when we
release the provider it works with the latest version of the service.
System Tests are part of the Airflow code, but they are executed and
verified by those 3rd party service teams. We are working with the 3rd
party service teams (who are often important stakeholders of the Apache
Airflow project) to add dashboards for the historical providers that are
managed by the community, and current set of Dashboards can be also found
at the Ecosystem: system test dashboards

Whenever someone (including Weaviate in the past) asked if they can
contribute providers - we always referred to that chapter and we said - "we
need to have good reason" and "we need to have confidence the integration
is not broken in the future.

So there are two conditions IMHO:

1) Having a good reason why we want it in
2) Having a confidence that we can keep the integration "working" in the
future without a lot of overhead and having to pay for the integration

Re 1) I think there is a very good reason why we want to have those in the
community - LLMs are all the rage and making Airflow with LLM as
first-class-citizen is no-brainer and Kaxil laid it out nicely in the email.
Re 2) I think this is a great opportunity for Astronomers to take the
"3rd-party maintenance" role to follow the "System Test dashboard" idea.

Of course Astronomer is going to be committed to it - no doubt about it :)
. And I believe Astronomer already runs similar tests using Airflow Managed
instances to run Airflow test cases (and more/less complex DAGs
regularly). As long as we have some basic example_dags/system_tests added
for those providers and they are run regularly on Astronomer managed
instances with accounts to Weaviate and others configured + some simple
dashboard where we can see the status of those DAG runs we should be good
to go.

Not everyone here is aware of that but there were already a number of
issues fixed by the MWAA team by simply being alerted by the regular system
tests and they were able to fix those issues before they made their way
into new releases. I - for one - usually take a quick look at the dashboard
before a new provider's release and it gives quite a lot of confidence that
some "serious" issues are not overlooked. Seeing a whole week of "all
green" there is reassuring - this was quite an effort from MWAA team to
implement it and keep it running, but I think the scope/complexity of LLM
integration is much lower - and those example dags should be far more
stable and straightforward to run by Astronomer, because the LLM cases are
generally much simpler than "infrastructure" cases of the multiple services
AWS integration requires.

It could be even a super simple dump of HTML to a public S3 bucket like
MWAA does - using Airflow to run it and Airflow API to retrieve the status
for example) + some alerting on Astronomer side to detect (and fix before
release) any issues would be more than enough and would check all the boxes
for me.

J.

On Tue, Oct 17, 2023 at 8:42 PM Kaxil Naik <kaxiln...@apache.org> wrote:

> Hey Everyone,
>
> As a follow-up to my Keynote talk, Building and deploying LLM applications
> with Apache Airflow <https://www.youtube.com/watch?v=mgA6m3ggKhs&t=4s>, I
> am formally proposing the addition of these 5 providers to the Apache
> Airflow repo:
>
>    -
>
>    PgVector <https://github.com/pgvector/pgvector>
>    -
>
>    Weaviate <https://weaviate.io/>
>    -
>
>    Pinecone <https://www.pinecone.io/>
>    -
>
>    OpenAI <https://openai.com/>
>    -
>
>    Cohere <https://cohere.com/>
>
>
> Advancements in LLMs are moving at a rapid pace & transforming the way we
> work and our industry. Although LLMs are simple to use in prototyping,
> using LLM for enterprise applications and for production still presents a
> lot of challenges. These
> <
> https://speakerdeck.com/kaxil/building-and-deploying-llm-applications-with-apache-airflow?slide=8
> >
> are some of the same problems that we tackle in Data Engineering, and
> Airflow is a natural fit for them.
>
> We at Astronomer would like to add first-class support for the popular LLMs
> (OpenAI & Cohere) and vector DBs (PgVector, Weaviate & Pinecone) so that
> Data Scientists and ML engineers can utilize them natively with easy-to-use
> Operator & Hook abstractions while providing a native (and
> Production-ready) approach for Authentication, retries, logging etc.
>
> We also think this is vital for the Apache Airflow project as we, the
> project, embrace the LLM tide and continue to be a great example of
> balancing innovation and maintaining backward-compatibility.
>
> The first versions of these providers will enable building one of the most
> common use cases of LLMs i.e. Question and Answering / Chatbots using
> Retrieval-augmented generation (RAG) done with the help of embeddings.
>
> Everyone is welcome and encouraged to contribute once the PRs are merged.
> Astronomer is committed to maintaining these providers in the Airflow repo,
> including reviewing PRs, maintaining code quality, testing and keeping the
> APIs up-to-date.
>
> Note: PgVector <https://github.com/pgvector/pgvector> is an open-source
> project, so we don’t need a formal vote for it as per our guidelines
> <
> https://github.com/apache/airflow/blob/main/PROVIDERS.rst#accepting-new-community-providers
> >.
> So please consider this email as seeking a Lazy Consensus for it.
>
> I will open up a VOTING thread after discussing this for a few days.
>
> Thanks.
>
> Regards,
>
> Kaxil
>

Re: [DISCUSSION] Add 5 new Providers to enable first-class LLMOps

Reply via email to