Re: [DISCUSSION] Add 5 new Providers to enable first-class LLMOps

Kaxil Naik Thu, 19 Oct 2023 04:50:47 -0700

Note: Weaviate <https://github.com/weaviate/weaviate> is also an
open-source project with over 470k+ downloads last month
<https://pypistats.org/packages/weaviate-client>.


On Thu, 19 Oct 2023 at 12:40, Kaxil Naik <kaxiln...@gmail.com> wrote:

> Absolutely, we will publish the results of test runs somewhere, we would
> probably start with dumping them in a publicly-accessible S3 bucket /
> Github issue and then move to a Dashboard.
>
>> Re 2) I think this is a great opportunity for Astronomers to take the
>> "3rd-party maintenance" role to follow the "System Test dashboard" idea.
>
>
> Yup, we run a lot of integration/system tests from Airflow main too which
> when break, we fix them with PRs to the main branch.
>
> It could be even a super simple dump of HTML to a public S3 bucket like
>> MWAA does - using Airflow to run it and Airflow API to retrieve the status
>> for example) + some alerting on Astronomer side to detect (and fix before
>> release) any issues would be more than enough and would check all the
>> boxes
>> for me.
>
>
> Regards,
> Kaxil
>
> On Wed, 18 Oct 2023 at 11:50, Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> I thought a bit about it, and I think the way we have "Astronomer" behind
>> it, it checks all the boxes - providing that we will also have some (super
>> simple) dashboard similar to the MWAA one
>> https://aws-mwaa.github.io/open-source/system-tests/dashboard.html .
>>
>> From
>>
>> https://github.com/apache/airflow/blob/main/PROVIDERS.rst#rd-party-providers
>> :
>>
>> > While we already have - historically - a number of 3rd-party service
>> providers managed by the community, most of those services have dedicated
>> teams that keep an eye on the community providers and not only take active
>> part in managing them (see mixed-governance model below), but also provide
>> a way that we can verify whether the provider works with the latest
>> version
>> of the service via dashboards that show status of System Tests for the
>> provider. This allows us to have a high level of confidence that when we
>> release the provider it works with the latest version of the service.
>> System Tests are part of the Airflow code, but they are executed and
>> verified by those 3rd party service teams. We are working with the 3rd
>> party service teams (who are often important stakeholders of the Apache
>> Airflow project) to add dashboards for the historical providers that are
>> managed by the community, and current set of Dashboards can be also found
>> at the Ecosystem: system test dashboards
>>
>> Whenever someone (including Weaviate in the past) asked if they can
>> contribute providers - we always referred to that chapter and we said -
>> "we
>> need to have good reason" and "we need to have confidence the integration
>> is not broken in the future.
>>
>> So there are two conditions IMHO:
>>
>> 1) Having a good reason why we want it in
>> 2) Having a confidence that we can keep the integration "working" in the
>> future without a lot of overhead and having to pay for the integration
>>
>> Re 1) I think there is a very good reason why we want to have those in the
>> community - LLMs are all the rage and making Airflow with LLM as
>> first-class-citizen is no-brainer and Kaxil laid it out nicely in the
>> email.
>> Re 2) I think this is a great opportunity for Astronomers to take the
>> "3rd-party maintenance" role to follow the "System Test dashboard" idea.
>>
>> Of course Astronomer is going to be committed to it - no doubt about it :)
>> . And I believe Astronomer already runs similar tests using Airflow
>> Managed
>> instances to run Airflow test cases (and more/less complex DAGs
>> regularly). As long as we have some basic example_dags/system_tests added
>> for those providers and they are run regularly on Astronomer managed
>> instances with accounts to Weaviate and others configured + some simple
>> dashboard where we can see the status of those DAG runs we should be good
>> to go.
>>
>> Not everyone here is aware of that but there were already a number of
>> issues fixed by the MWAA team by simply being alerted by the regular
>> system
>> tests and they were able to fix those issues before they made their way
>> into new releases. I - for one - usually take a quick look at the
>> dashboard
>> before a new provider's release and it gives quite a lot of confidence
>> that
>> some "serious" issues are not overlooked. Seeing a whole week of "all
>> green" there is reassuring - this was quite an effort from MWAA team to
>> implement it and keep it running, but I think the scope/complexity of LLM
>> integration is much lower - and those example dags should be far more
>> stable and straightforward to run by Astronomer, because the LLM cases are
>> generally much simpler than "infrastructure" cases of the multiple
>> services
>> AWS integration requires.
>>
>> It could be even a super simple dump of HTML to a public S3 bucket like
>> MWAA does - using Airflow to run it and Airflow API to retrieve the status
>> for example) + some alerting on Astronomer side to detect (and fix before
>> release) any issues would be more than enough and would check all the
>> boxes
>> for me.
>>
>>
>> J.
>>
>>
>> On Tue, Oct 17, 2023 at 8:42 PM Kaxil Naik <kaxiln...@apache.org> wrote:
>>
>> > Hey Everyone,
>> >
>> > As a follow-up to my Keynote talk, Building and deploying LLM
>> applications
>> > with Apache Airflow <https://www.youtube.com/watch?v=mgA6m3ggKhs&t=4s>,
>> I
>> > am formally proposing the addition of these 5 providers to the Apache
>> > Airflow repo:
>> >
>> >    -
>> >
>> >    PgVector <https://github.com/pgvector/pgvector>
>> >    -
>> >
>> >    Weaviate <https://weaviate.io/>
>> >    -
>> >
>> >    Pinecone <https://www.pinecone.io/>
>> >    -
>> >
>> >    OpenAI <https://openai.com/>
>> >    -
>> >
>> >    Cohere <https://cohere.com/>
>> >
>> >
>> > Advancements in LLMs are moving at a rapid pace & transforming the way
>> we
>> > work and our industry. Although LLMs are simple to use in prototyping,
>> > using LLM for enterprise applications and for production still presents
>> a
>> > lot of challenges. These
>> > <
>> >
>> https://speakerdeck.com/kaxil/building-and-deploying-llm-applications-with-apache-airflow?slide=8
>> > >
>> > are some of the same problems that we tackle in Data Engineering, and
>> > Airflow is a natural fit for them.
>> >
>> > We at Astronomer would like to add first-class support for the popular
>> LLMs
>> > (OpenAI & Cohere) and vector DBs (PgVector, Weaviate & Pinecone) so that
>> > Data Scientists and ML engineers can utilize them natively with
>> easy-to-use
>> > Operator & Hook abstractions while providing a native (and
>> > Production-ready) approach for Authentication, retries, logging etc.
>> >
>> > We also think this is vital for the Apache Airflow project as we, the
>> > project, embrace the LLM tide and continue to be a great example of
>> > balancing innovation and maintaining backward-compatibility.
>> >
>> > The first versions of these providers will enable building one of the
>> most
>> > common use cases of LLMs i.e. Question and Answering / Chatbots using
>> > Retrieval-augmented generation (RAG) done with the help of embeddings.
>> >
>> > Everyone is welcome and encouraged to contribute once the PRs are
>> merged.
>> > Astronomer is committed to maintaining these providers in the Airflow
>> repo,
>> > including reviewing PRs, maintaining code quality, testing and keeping
>> the
>> > APIs up-to-date.
>> >
>> > Note: PgVector <https://github.com/pgvector/pgvector> is an open-source
>> > project, so we don’t need a formal vote for it as per our guidelines
>> > <
>> >
>> https://github.com/apache/airflow/blob/main/PROVIDERS.rst#accepting-new-community-providers
>> > >.
>> > So please consider this email as seeking a Lazy Consensus for it.
>> >
>> > I will open up a VOTING thread after discussing this for a few days.
>> >
>> > Thanks.
>> >
>> > Regards,
>> >
>> > Kaxil
>> >
>>
>

Re: [DISCUSSION] Add 5 new Providers to enable first-class LLMOps

Reply via email to