Re: Airflow 3 Dev Call #1 with Proposed Agenda

2024-05-29 Thread Jarek Potiuk
Added my comments too :). Looks great

On Wed, May 29, 2024 at 12:14 PM Kaxil Naik  wrote:

> Thanks Shubham and Jens, I will take a look later today.
>
> On Wed, 29 May 2024 at 03:07, Mehta, Shubham 
> wrote:
>
> > Kaxil - thank you for creating the wiki, setting up call invites, and
> > starting this thread. The first draft of the principles looks great. I
> have
> > commented on the wiki with some feedback and personal thoughts. I'm not
> > adding the comments on this thread to keep the wiki as the single place
> for
> > discussions and feedback.
> >
> > Thanks
> > Shubham
> >
> > On 2024-05-28, 12:25 PM, "Kaxil Naik"  > kaxiln...@apache.org>> wrote:
> >
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> > le contenu ne présente aucun risque.
> >
> >
> >
> >
> >
> >
> > If the formatting of bullets was lost, check below or at
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes#Airflow3Devcall:MeetingNotes-4June2024
> > <
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes#Airflow3Devcall:MeetingNotes-4June2024
> > >
> >
> >
> > Proposed Agenda:
> > 1) Agreeing on the Principles to drive Airflow 3 development
> > 2) Agreeing on the Guidelines that help decide if a feature should be in
> > Airflow 3 or not
> >
> >
> >
> >
> > For (1), I propose the following principles:
> >
> >
> > - Considering Airflow 3.0 for early adopters and breaking (and removing)
> > things. Things can be re-added as needed in upcoming minor releases.
> > - Optimize to get foundational pieces in and not "let perfect be the
> enemy
> > of good"
> > - Smoother migration path between AF 2 & 3 especially for DAG Authors
> with
> > the existing official Airflow providers.
> > - Working on features that solidify Airflow as the modern Orchestrator
> > that also has state-of-the-art support for Data, AI & ML workloads.
> > - This includes improving scalability & performance of all the Airflow
> > components.
> > - Making Airflow aware of what's happening in the task to provide better
> > auditability, lineage & observability
> > - Set up the codebase for the next 3-5 years.
> > - Reducing matrix of supported combinations for reducing complexity in
> > testing & development. E.g Remove MySQL support to reduce the test matrix
> > - Simplifying codebase & standardize architecture (e.g consolidating
> > serialization methods)
> > - Remove deprecations
> > - Simplify the Learning Curve for new Airflow users
> > - Shift focus on Airflow 2 to stability: bug fixes + security fixes after
> > AF 2.10. This should continue for a longer period of time after AF 3
> release
> > - Target a shorter cycle to release Airflow 3
> > - so that Airflow 2 branches for features don't diverge
> > - have enough time between Airflow 3 release and Airflow Summit 2025, so
> > we can have talks about Successful migrations
> >
> >
> > For (2), I propose the following guidelines:
> >
> >
> > - Alignment with Core Principles
> > - Community Demand and Feedback
> > - Impact on Scalability and Performance
> > - Implementation Complexity and Maintenance
> > - Backward Compatibility and Migration Effort
> > - Workstream Ownership (can be more than one). If no one is available to
> > lead the workstream, the feature will be parked until a dedicated owner
> is
> > found
> > - For big epics, AIPs & a successful vote on the dev mailing list
> >
> >
> > Please reply if anyone has anything to add to the agenda or comment on
> > anything if you disagree.
> >
> >
> > Looking forward to the call.
> >
> >
> > Regards,
> > Kaxil
> >
> >
> > On 2024/05/28 19:11:14 Kaxil Naik wrote:
> > > Hi all,
> > >
> > > As discussed in the previous email thread, the first dev call has been
> > > pushed to next Tuesday (4th June 2024).
> > >
> > > If you would like to participate in the development of Airflow 3,
> please
> > > join the dev calls starting next week. The calls will be open to anyone
> > in
> > > the community.
> > >
> > > *Schedule*: June 4, 2024, Tuesday, at 05:00 PM BST (4 PM GMT/UTC | 12
> PM
> > > EST | 9 AM PST)
> > > *One-time registration Link*:
> > >
> >
> https://astronomer.zoom.us/meeting/register/tZAsde2vqDwpE9XrBAbCeIFHA_l7OLywrWkG
> > <
> >
> https://astronomer.zoom.us/meeting/register/tZAsde2vqDwpE9XrBAbCeIFHA_l7OLywrWkG
> > >
> > >
> > > The meeting notes from the call will also be posted on the dev mailing
> > list
> > > and Confluence for archival purposes
> > > at
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes
> > <
> >
> 

Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-28 Thread Jarek Potiuk
For your information - the last PR has been merged
https://github.com/apache/airflow/pull/39862  and our CI should now run a
full set of Provider unit tests with Airflow 2.9.1. 2.8.4 and 2.7.3 - thus
giving us more confidence that provider changes do not depend on features
implemented in newer versions of Airflow.

Important part of the change: It is very easy to reproduce and run such
tests locally thanks to Breeze's versatility. For example this:

breeze shell --use-airflow-version 2.7.3 --mount-sources providers-and-tests

Should start the breeze container and drop you in a shell where you will be
able to iterate on provider tests run against airflow 2.7.3.

Also there are instructions explaining how to run such tests in our
contributing docs:
https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases
that also explain how to deal with common cases of writing a compatibile
test case - whenever Compatibility tests will fail for you, you will be
directed to those instructions.


J.


On Wed, May 15, 2024 at 10:23 PM Jarek Potiuk  wrote:

> The 2.9 compatibility tests are now merged !  The 2.8 PR  is almost ready
> as a follow up and the next thing will be 2.7.
>
> Small thing for everyone to take a look at now is to make sure the tests
> are also passing 2.9 (and later 2.8 and 2.7) - but this should be just a
> job failing in your PRs if they aren't. I also added documentation
> https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#running-provider-compatibility-tests
> explaining how to handle common cases (based on the changes merged for 2.9
> compatibility) and explaining how you can easily reproduce such
> compatibility test locally,
>
> When compatibility tests will fail, the error message will explain it and
> link to that documentation. I hope this will help to get it seamlessly
> continued in the future, if you have any problems - ping me on slack I will
> try to help if you will have any issues (but till Tuesday I am on PyCon so
> don't expect my usual availability).
>
> J.
>
> On Mon, May 13, 2024 at 11:02 AM Jarek Potiuk  wrote:
>
>> OK. Tests should be green now - all the issues are "handled" - there are
>> few follow up tasks from the tests run on 2.9.1 but the PR should be quite
>> ready for final review and merge now and I can attempt to look at 2.8
>> compatibility once it's done.
>>
>> On Mon, May 13, 2024 at 1:17 AM Jarek Potiuk  wrote:
>>
>>> OK. I think I found and fixed all the compatibility issues in
>>> https://github.com/apache/airflow/pull/39513 - except one last
>>> openlineage plugin enablement fix (but I think reviews for all the other
>>> changes would be great until we fix the issue). There are probably a few
>>> incompatibilities that will need to be addressed before we release 2.1.10
>>> so I need confirmation / comments from Niko/Daniel if my findings are
>>> correct.
>>>
>>> On Fri, May 10, 2024 at 12:54 PM Jarek Potiuk  wrote:
>>>
>>>> > Just for clarification, this is only related to the provider's tests,
>>>> right?
>>>>
>>>> Absolutely.
>>>>
>>>> On Fri, May 10, 2024 at 11:21 AM Andrey Anshin <
>>>> andrey.ans...@taragol.is> wrote:
>>>>
>>>>> > "enable" tests for 2.8 and 2.7 separately
>>>>>
>>>>> Just for clarification, this is only related to the provider's tests,
>>>>> right?
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 10 May 2024 at 13:15, Jarek Potiuk  wrote:
>>>>>
>>>>> > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to
>>>>> move
>>>>> > forward to the initial version of Airflow 3 providers might already
>>>>> drop
>>>>> > support of these versions in providers.
>>>>> > Airflow 2.7 in the mid of August 2024
>>>>> > Airflow 2.8 in the mid of December 2024
>>>>> >
>>>>> > Yep. But also "here and now" those compatibility tests might help us
>>>>> to
>>>>> > find some hidden incompatibilities (and prevent adding future ones).
>>>>> We can
>>>>> > see how much complexity we are dealing with when we attempt to
>>>>> enable the
>>>>> > tests for 2.8 and then 2.7 and decide if it's worth it. The change I
>>>>> added
>>>>> > makes it easy to just "enable" tests for 2.8 a

Re: [VOTE] May 2024 PR of the Month

2024-05-28 Thread Jarek Potiuk
#39336 hands down

On Tue, May 28, 2024 at 7:02 PM Briana Okyere
 wrote:

> Hey All,
>
> It’s once again time to vote for the PR of the Month!
>
> With the help of the `get_important_pr_candidates` script in dev/stats,
> we've identified the following candidates:
>
> PR #39510: Add Scarf based telemetry <
> https://github.com/apache/airflow/pull/39510>
>
> PR #39513: Run unit tests with airflow installed from packages <
> https://github.com/apache/airflow/pull/39513>
>
> PR #39365: Fix the pinecone system test <
> https://github.com/apache/airflow/pull/39365>
>
> PR #39336: Scheduler to handle incrementing of try_number
> 
>
> PR #39650: Add metrics about task CPU and memory usage <
> https://github.com/apache/airflow/pull/39650>
>
> Please reply to this thread with your selection or offer your own
> nominee(s).
>
> Voting will close on Friday May 31st at 9 AM PST. The winner(s) will be
> featured in the next issue of the Airflow newsletter.
>
> Also, if there’s an article or event that you think should be included in
> this or a future issue of the newsletter, please drop me a line at <
> briana.oky...@astronomer.io>
>
> --
> Briana Okyere
> Community Manager
> Astronomer
>


Re: Airflow 3 Dev Calls: Registration + Schedule

2024-05-27 Thread Jarek Potiuk
Just a note - the first meeting is on 30th of May which is holiday in quite
a number of countries (Corpus Cristi)
https://www.officeholidays.com/holidays/corpus-christi  - and taking
into account it's always Thursday, usually people bridge it with weekend
:). I personally will not ne available.

Maybe it's worth moving the first call to next Thursday instead (6th) ?

Personally, that would also be after the Community Over Code conference
that happens Mon-Wed in Bratislava (I am leading data engineering track
there and talk about Airflow too).

J


On Sat, May 25, 2024 at 1:36 AM Kaxil Naik  wrote:

> Hi all,
>
> If you would like to participate in the development of Airflow 3, please
> join the dev calls starting next week. The calls will be open to anyone in
> the community.
>
> *Schedule*: Starting May 30, 2024, at 04:00 PM BST (3 PM GMT/UTC | 11 AM
> EST | 8 AM PST), it will be weekly until we agree on the principles and
> fortnightly after that.
> *One-time registration Link*:
>
> https://astronomer.zoom.us/meeting/register/tZAsde2vqDwpE9XrBAbCeIFHA_l7OLywrWkG
> *Add to your calendar*:
>
> https://astronomer.zoom.us/meeting/tZAsde2vqDwpE9XrBAbCeIFHA_l7OLywrWkG/calendar/google/add
>
> The meeting notes from the call will also be posted on the dev mailing list
> and Confluence for archival purposes (for example
> ). At
> the end of each call, I would solicit ideas for the agenda for the next
> call and propose it to the broader group on the mailing list.
>
> Some of the items that should be discussed in the upcoming calls IMO:
>
>- Agreeing on Principles
>- Agreeing on the Guidelines that help decide if a feature should be in
>Airflow 3 or not.
>- Workstream & Stream Owners
>- Airflow 2 support policy
>- Separate discussions for each big workstream including one for items
>to remove & refactor (e.g dropping MySQL)
>- Discussion to streamline the development of Airflow 3
>- Finalize Scope + Timelines
>- Migration Utilities
>- Progress check-ins
>
> I will send a separate email for the first dev call in the next few days.
>
> Regards,
> Kaxil
>


Re: [DISCUSS] AIP-71 Generalizing DAG Loader and Processor for Ephemeral Storage

2024-05-27 Thread Jarek Potiuk
> It's a long long read @Jarek Potiuk  ...

You did invite people to comment :). Those were all the areas that came to
my mind :)

> In summary, moving to pathlib.Path enhances flexibility and makes future
integrations, like AIP-63, more manageable while not forcing any immediate
changes to current configurations or codebases.

Yep. I like the Versioned FS abstraction on top of the FSSpec - even if it
is not directly supported by FSSpec and we have to build it ourselves. I
will leave it to the AIP-63 authors to comment on that - if they see it
benefits them then yes, I am a little less torn on it with that explanation
and more in favour of it. Also the "pull" model where the task only reads
the files it needs might be interesting (but would be great to verify if
that's really the case - this is mostly useful as a git-sync replacement as
other solutions based on shared volumes already basically do it this way
under the hood. The Versioned FS layer will have to be rather carefully
designed though if AIP-63 bases on that.

Good thing is that FsSPEC is generally very well adopted and used by pretty
much every other player in the Data space, so chance is, a lot of the
issues with syncing are solved there already and will continue to be solved.

J.




On Mon, May 27, 2024 at 12:17 PM Bolke de Bruin  wrote:

> It's a long long read @Jarek Potiuk  ...
>
> In general, you are right. The intention is to eliminate the need for an
> external sync, which is beneficial for Kubernetes (k8s) workers. They would
> only need to load one DAG and its resources, rather than syncing the entire
> DAG folder. This can be done selectively if the task doesn't require all
> the resources in the DAG.
>
> Moreover, moving to pathlib.Path provides flexibility without any
> downsides - there is no forced upgrade required. It does not necessitate
> getting rid of external sync mechanisms. You can still use external sync,
> and it will function as it does now. We can transition to using
> pathlib.Path without the "ephemeral storage" component. Adopting
> pathlib.Path would also improve the fundamentals to enable DAG loading on
> M.S. Windows and offer several optimizations, such as reducing os.stat
> usage and enabling early loop exits for filters.
>
> Using pathlib.Path with local storage does not require changes to existing
> DAG code. If you move to the ephemeral model, it would be advised to use
> the pathlib.Path abstraction in your DAGs, which is the preferred method
> for file access in Python. However, there is no forced upgrade path; you
> can choose if and when to adopt ephemeral storage. Having DAG processing
> work with pathlib.Path will make implementing AIP-63 much easier.
>
> Integration of AIP-63 - Layered Architecture
>
> ---
> | DAG Processing   |
> ---
> | pathlib.Path   |
> ---
> | UPath / ObjStPath |
> ---
> | VersionedFS*|
> ---
> | S3/Git/GCS/Local |
> ---
> | FSSpec |
> ---
>
> This architecture suggests extending or using FSSpec to define a
> VersionedFS, similar to ZipFS, by leveraging the versioning capabilities of
> underlying storage systems like Git, S3, etc. Using pathlib.Path as an
> abstraction is essential to make this possible. Relying on "os" would
> require building all versioning logic into the DAG processing or
> externalizing it to a syncer. This architecture is simpler and easier to
> maintain due to lighter coupling.
>
> Strategies for Loading Additional Resources with Ephemeral Storage:
>
> 1. Resource loading through a custom Resource Loader (preferred, best
> practice)
> 2. Zip file containing all resources for a DAG
> 3. Using pathlib.Path / ObjectStoragePath directly
> 4. Downloading resources by the custom module loader by convention
> (e.g., package_name/resources/)
> 5. Reading a manifest by the custom module loader and loading from
> there
>
> If using local storage with pathlib.Path, you can rely on the old
> strategies like direct loading.
>
> While my previous reply to Elad seemed to be lost due to Apache's
> mailer-daemon, but Github (remote) and GitFS (local) are already supported,
> offering extensive versioning capabilities. Other storage solutions like
> GCS, S3, and ADLS have limited versioning support - as you mentioned - and
> may require bundling methods such as sparse bundles or zip files. Having
> this abstracted away in VersionedFS gives us much flexibility.
>
> In summary, moving to pathlib.Path enhances flexibility and makes future
> integrations, like AIP-63, more manageable while not forcing any immedia

Re: [VOTE] Airflow Providers prepared on May 26, 2024

2024-05-26 Thread Jarek Potiuk
+1 (binding) - checked reproducibility, checksums, licences. signatures.
Checked my changes.

On Sun, May 26, 2024 at 11:03 AM Elad Kalif  wrote:

> Hey all,
>
> I have just cut the new wave Airflow Providers packages. This email is
> calling a vote on the release, which will last for 72 hours - which means
> that it will end on May 29, 2024 09:00 AM UTC and until 3 binding +1 votes
> have been received.
>
> Consider this my (binding) +1.
>
> Airflow Providers are available at:
> https://dist.apache.org/repos/dist/dev/airflow/providers/
>
> *apache-airflow-providers--*.tar.gz* are the binary
>  Python "sdist" release - they are also official "sources" for the provider
> packages.
>
> *apache_airflow_providers_-*.whl are the binary
>  Python "wheel" release.
>
> The test procedure for PMC members is described in
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-pmc-members
>
> The test procedure for and Contributors who would like to test this RC is
> described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-contributors
>
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but members of the community are
> encouraged to test the release and vote with "(non-binding)".
>
> Please note that the version number excludes the 'rcX' string.
> This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release.
>
> The status of testing the providers by the community is kept here:
> https://github.com/apache/airflow/issues/39842
>
> The issue is also the easiest way to see important PRs included in the RC
> candidates.
> Detailed changelog for the providers will be published in the documentation
> after the
> RC candidates are released.
>
> You can find the RC packages in PyPI following these links:
>
> https://pypi.org/project/apache-airflow-providers-airbyte/3.8.1rc1/
> https://pypi.org/project/apache-airflow-providers-alibaba/2.8.1rc1/
> https://pypi.org/project/apache-airflow-providers-amazon/8.23.0rc1/
> https://pypi.org/project/apache-airflow-providers-apache-beam/5.7.1rc1/
>
> https://pypi.org/project/apache-airflow-providers-apache-cassandra/3.5.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-drill/2.7.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-druid/3.10.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-flink/1.4.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-hdfs/4.4.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-hive/8.1.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-impala/1.4.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-kafka/1.4.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-kylin/3.6.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-livy/3.8.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-pig/4.4.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-pinot/4.4.1rc1/
> https://pypi.org/project/apache-airflow-providers-apache-spark/4.8.1rc1/
> https://pypi.org/project/apache-airflow-providers-apprise/1.3.1rc1/
> https://pypi.org/project/apache-airflow-providers-arangodb/2.5.1rc1/
> https://pypi.org/project/apache-airflow-providers-asana/2.5.1rc1/
> https://pypi.org/project/apache-airflow-providers-atlassian-jira/2.6.1rc1/
> https://pypi.org/project/apache-airflow-providers-celery/3.7.1rc1/
> https://pypi.org/project/apache-airflow-providers-cloudant/3.5.1rc1/
> https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/8.3.0rc1/
> https://pypi.org/project/apache-airflow-providers-cohere/1.2.1rc1/
> https://pypi.org/project/apache-airflow-providers-common-io/1.3.2rc1/
> https://pypi.org/project/apache-airflow-providers-common-sql/1.14.0rc1/
> https://pypi.org/project/apache-airflow-providers-databricks/6.5.0rc1/
> https://pypi.org/project/apache-airflow-providers-datadog/3.6.1rc1/
> https://pypi.org/project/apache-airflow-providers-dbt-cloud/3.8.1rc1/
> https://pypi.org/project/apache-airflow-providers-dingding/3.5.1rc1/
> https://pypi.org/project/apache-airflow-providers-discord/3.7.1rc1/
> https://pypi.org/project/apache-airflow-providers-docker/3.12.0rc1/
> https://pypi.org/project/apache-airflow-providers-elasticsearch/5.4.1rc1/
> https://pypi.org/project/apache-airflow-providers-exasol/4.5.1rc1/
> https://pypi.org/project/apache-airflow-providers-fab/1.1.1rc1/
> https://pypi.org/project/apache-airflow-providers-facebook/3.5.1rc1/
> https://pypi.org/project/apache-airflow-providers-ftp/3.9.1rc1/
> https://pypi.org/project/apache-airflow-providers-github/2.6.1rc1/
> https://pypi.org/project/apache-airflow-providers-google/10.19.0rc1/
> 

Re: [DISCUSS] AIP-71 Generalizing DAG Loader and Processor for Ephemeral Storage

2024-05-26 Thread Jarek Potiuk
It is an interesting read indeed. And might be a good direction. But agree
with Ash that just Python loading is quite not enough (but might be
somewhat solvable in a possibly breaking way with ResourceLoader).

> A sync component will still be required to sync from git to S3/GCS/Other
storage
and this AIP solves only the part that Airflow machines will be able to
fetch the files from storage. Is that correct?

Looking at Elad's comment - first - let me rephrase how I understand the
proposal :).

As I understand (and Bolke to confirm) - this change  would allow us to get
rid of the sync component (or shared volumes) altogether. In the case you
describe Elad - there is no need to sync Git to S3/GCS/Other -
GitFileSystem could be used directly (very similarly to deployment where
git-sync is used currently).

If I understand the idea - we would have to have a custom source loader
https://docs.python.org/3/library/importlib.html#importlib.abc.SourceLoader
that would allow DAGFileProcessor to import the files using one of the
selected FSspec backends. Also to what Ash wrote - we could also implement
custom Resource Loader:
https://docs.python.org/3/library/importlib.html#importlib.abc.ResourceLoader
which
might partially solve the problem of sql /yaml files. But it might require
some fixes in our code for templated fields/files, and custom code will
likely be broken if the files are read directly from filesystem rather than
via Python resource API - I think most people will not use resource API to
read the files, they will fall back to direct file reading by constructing
their path so this will be rather heavily breaking for those cases
(unfortunately people use `__file__' to find and read files traditionally
and __file__ is a string not `Path`).

This means that basically the synchronization component in the deployment
is not needed at all. Currently the responsibility to sync the files is
delegated out to the deployment (git-sync, s3fs, shared filesystem etc).
If we switch to FSSpec, Airflow ("DAG file processor" basically) will take
over the responsibility (and processing/io/etc. needed) to synchronize and
load the files from such remote storage (via fsspec library of choice.
Currently airflow is always reading files from a local "filesystem" without
worrying about how it has been synced. Airflow assumes someone else synced
the files.

This all is done in user space (so we cannot utilize any kernel space
optimizations that shared volumes provide). This also means that
credentials to the remote file system must be stored in Airflow Connections
(where currently the whole syncing process, including authentication is
done externally). And we can fall-back to the current behaviour with Local
File System fsspec implementation (and keep shared volumes/git sync as
something people could still continue to use) - so this only "adds"
features to airflow syncing rather than replacing it (possibly Local
Filesystem fsspec will be backwards-compatible ).

So for example what happens when GIT FS is used, when SourceLoader is
invoked, it will transparently pull the right version from git. It would be
interesting to see the maturity and characteristics of such a solution and
cache effectiveness.

Do I understand it correctly, Bolke?

BTW. I would not be surprised if someone already implemented such loaders,
Seems this could be a generic need people might have.

Comments:

In the current solution we already use caching coming from the OS -
basically when files are stored in a local filesystem and cached in any of
the current syncing solutions. None of them will really go all the way to
download all the files if they have not changed - at most they will check
if they did not change). POSIX os will cache the files - they are
memory-mapped and basically reading multiple times from the same file is
really reading them from memory. So caching as an advantage is possibly not
giving us much. Especially that fsspec does
https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
-> since we are going to read whole python files not part of files,
essentially "filecache" needs to be used rather than block cache, so we
effectively have the same effect as what current "git-sync", or "s3fs" or
"gcs-fuse", or "shared filesystem" but this is just incorporated into
particular fsspec library that is embedded in our code, rather than being
done by the external tool. All of them have various levels of caching files
locally.

There might be other performance benefits of using FSSpec though - for
example it might speed up git-sync case - if I understand correctly GitFS
will only fetch individual files when asked for them so for example task
execution might be faster as it will only pull one or few files needed when
running in not "warmed up" instance - for example in K8S executor. But it
needs to be verified if that really brings some improvements.

The versioning option is indeed interesting. For AIP-63 we will have to be

Re: [DISCUSS] Proposal to enhance Backfills

2024-05-26 Thread Jarek Potiuk
Yes. Long time awaited - and indeed some implementation details would be
needed to get it to AIP. And I also think one important decision to
consider - should it be targeting Airflow 2?

On Sun, May 26, 2024 at 12:26 PM Elad Kalif  wrote:

> > In order for this to become a reality, Backfills need to be handled by
> the
> Airflow Scheduler as a normal DAG execution
>
> I think it's a good idea.
> It should solve natively problems like
> https://github.com/apache/airflow/issues/11302
>
> On Fri, May 24, 2024 at 10:58 PM Vikram Koka  >
> wrote:
>
> > Fellow Airflowers,
> >
> > I am following up on some of the proposed changes in the Airflow 3
> proposal
> > <
> >
> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/
> > >,
> > where more information was requested by the community.
> >
> > One specific topic was "Running Backfills at scale". This is not yet a
> full
> > fledged AIP, but a starting point for the discussion leading towards an
> AIP
> > with fully defined technical details.
> > Backfills at scale
> >
> > Backfills in Airflow 2.x are treated as an exception and executed by an
> > incarnation of the BackfillJob, rather than the regular Airflow Scheduler
> > itself. This results in unexpected interactions with the other DAGs being
> > run by the main Airflow Scheduler at the same time including resource
> > contention and possibly unexpected delays because established scalability
> > configuration settings such as Concurrency are not consistently applied,
> > and also code-level complexity by having two somewhat-similar
> > implementations of scheduling logic.
> >
> >
> > However, with ML model training, backfills are a common operation and
> need
> > to be treated as a regular Airflow DAG / Task execution operation and not
> > treated as an exception. It is also not possible to run a backfill unless
> > you have direct access to the Airflow database/SSH access to the Airflow
> > server , which is not possible for many/most data engineers.
> >
> >
> > In order for this to become a reality, Backfills need to be handled by
> the
> > Airflow Scheduler as a normal DAG execution, building on the Dynamic Task
> > Mapping execution pattern, rather than an exception. Additionally,
> Backfill
> > tasks will now ONLY be executed by the Airflow Workers, for obvious
> reasons
> > including scalability. A less obvious, but important reason is Security,
> > since it is ideal to have data connections to Enterprise data only happen
> > through Airflow Workers, rather than any Airflow system components.
> >
> >
> > As part of making Backfill support cleaner in Airflow, Backfill DAG
> > execution will also be supported in the Airflow REST API.
> >
> >
> > This proposal is purposefully light on exact implementation details but
> > will include at least:
> >
> >
> >
> >-
> >
> >Making the Airflow Scheduler responsible for scheduling decisions on
> all
> >DagRuns (instead of the current where it purposefully ignores backfill
> > runs)
> >-
> >
> >A new API endpoint to submit a "backfill request".
> >
> >
> > --
> >
> >
> > Best regards,
> > Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau
> >
>


Re: [VOTE] AIP-69 Remote Executor

2024-05-18 Thread Jarek Potiuk
PR to review before
> accepting the AIP? Are you looking for certain patterns, modules which are
> used?
> - @jarek: PoC can be made but I fear that is already 50% of effort to a
> MVP. That is why I was seeking for feedback if this would be the right way
> before spending efforts.
> - @jarek: AIP-44 would certainly be leveraged. I have no plan to replicate
> existing API and I know that it will be a challenge to de-couple the task
> execution. Do you expect me to evaluate all details before a vote? I would
> have planned a pragmatic approach, once some dependency missing I would
> maybe have contributed to AIP-44 efforts to close gaps. But AIP-44 does not
> make a remote scheduling/execution possible as it does not include the
> remote worker and executor component. That is to be added in AIP-69.
> - @jarek/@ash: You briefly talked about the "task isolation" but are there
> any concepts you can share which will help? I understood you are _thinking_
> about it and there will be papers _soon_? If there is anything that
> contributes, please share. I have no glassball.
> @jarek: Lifecyle of task and adoption etc.: Yes this is something to be
> addressed and is a must to be included. Mainly it would be around a
> heartbeat logic. I am sure there will be something to be done and
> resiliency will be covered. If you have requirements or ideas please
> contribute. Do you see this as a must have in the AIP before going to vote?
> Or is it just important it is covered?
> @jarek: API promises on exactly once: My plan is to use and rely on DB
> locks and transactions from the API. Assuming that something can go wrong
> also between API and remote side I would add an additional confirmation in
> the heartbeat when the task was accepted and is starting on remote (to
> cover the use case that a task is assigned and DB transaction committed and
> response is not reaching the remote because of... line breaks down). Als
> certainly if a remote worker is "lost" and does not heartbeat for a (to be
> configured) timeout then a task must be re-assigned or assumed to be failed.
>
> P.S.: I'd like to take your feedback serious, the AIP process description
> in Confluence just tells: "Once you or someone else feels like there’s a
> rough consensus on the idea and there’s no strong opposition, you can move
> your proposal to the Vote phase." - neither this nor the structure template
> mention that a technical spec or PR must be provided prior vote. If you
> feel that an AIP should include this then I assume the contribution docs
> need to be adjusted.
>
> Mit freundlichen Grüßen / Best regards
>
> Jens Scheffler
>
> Alliance: Enabler - Tech Lead (XC-AS/EAE-ADA-T)
> Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> GERMANY | www.bosch.com
> Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> jens.scheff...@de.bosch.com
>
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> Geschäftsführung: Dr. Stefan Hartung, Dr. Christian Fischer, Dr. Markus
> Forschner,
> Stefan Grosch, Dr. Markus Heyn, Dr. Frank Meyer, Dr. Tanja Rückert
>
> -Original Message-
> From: Vikram Koka 
> Sent: Saturday, May 18, 2024 6:57 PM
> To: dev@airflow.apache.org
> Subject: Re: [VOTE] AIP-69 Remote Executor
>
> I agree with Jarek and Ash on this.
>
> I believe that the AIP as written documents the “what” and the “why” very
> well, but is too light on the “how”.
>
> I very much would like to see this AIP become reality as well, but I
> believe that we need some foundational elements such as API-44 and the
> “task context” concept to take this AIP to fruition with enough
> functionality to be meaningful.
>
> It’s entire possible that you are proposing something in between that’s
> feasible in the short-term, but that’s not clear to me yet.
>
> Vikram
>
>
> On Sat, May 18, 2024 at 8:46 AM Jarek Potiuk  wrote:
>
> > Agree with Ash. I also have a feeling that this is a very generic
> > description and some POC describing the approach we would like to take
> > here is needed to be able to vote on it. Feels a bit too early to vote
> on it.
> >
> > A lot of internals of the current (Airflow 2) way Airflow handling of
> > tasks running in executor are about database communication and if you
> > look closely, decoupling those internals to make it works with the
> > current executor API might be either very difficult or complex if we
> > stick to the current task <> airflow communication patterns. In some
> > ways, you already get "Remote Executor" by simply completing AIP-44
> > (which precisely provides an HTTP-based

Re: [VOTE] AIP-69 Remote Executor

2024-05-18 Thread Jarek Potiuk
Agree with Ash. I also have a feeling that this is a very generic
description and some POC describing the approach we would like to take here
is needed to be able to vote on it. Feels a bit too early to vote on it.

A lot of internals of the current (Airflow 2) way Airflow handling of tasks
running in executor are about database communication and if you look
closely, decoupling those internals to make it works with the current
executor API might be either very difficult or complex if we stick to the
current task <> airflow communication patterns. In some ways, you already
get "Remote Executor" by simply completing AIP-44 (which precisely provides
an HTTP-based. interface between tasks and scheduler in a very similar
fashion as AIP-69 proposal, but without changing the communication
patterns.

As I see it it generally looks like AIP-69 is either the same as AIP-44 or
the same as the future "Airflow 3" task isolation we've been discussing
about and Ash is working on. Both AIP-44 and the future Task Isolation aim
to solve pretty much the same problem but AIP-44 in a very "backwards
compatible" way and limited change on how to achieve "remote execution"
without actually making a lot of ripple effect on all other components.

So I would really love to understand if AIP-69 is really something
in-between and how - but the "lightweitness" of description makes it
difficult to understand how different AIP-69 is from those two (of course
we should see the future task isolation AIP as well to understand it better
and know what kind of back-compatibilities it will involve in Airflow 3).

I think at the very minimum we should see the proposal of how the API will
look like between the task and executor (including the whole life-cycle of
tasks), the way how we are going to implement all the complexity involved
with task adoption, edge cases of scheduling, execution semantic promises
the API will hold (exactly-once, at-most-once, at least once) - something
that comes as given for celery, the queuing mechanism and technologies used
(how do we handle distributed case where you have to manage multiple
workers running and how the tasks will be distributed etc. etc.

For me the AIP currently is mostly documenting a "wishlist" but the
implementation details on which of those wishes we implement, and which
not, and how is very much absent.

J.

On Sat, May 18, 2024 at 10:41 AM Ash Berlin-Taylor  wrote:

> Can we have a link to the pr please? The AIP doc itself is still light on
> what changes are actually needed
>
> On 18 May 2024 14:56:57 BST, Aritra Basu  wrote:
> >+1 (non-binding)
> >The proposal was a good read, would love to see it come up and would love
> >to help out if you need a helping hand.
> >
> >--
> >Regards,
> >Aritra Basu
> >
> >On Sat, May 18, 2024, 7:15 PM Christian Schilling
> > wrote:
> >
> >> Hi Jens,
> >>
> >> Thank you very much for the proposal!
> >> This would be cool to have such a feature in Airflow.
> >>
> >> +1 non-binding
> >>
> >> Best,
> >>
> >> Chris
> >>
> >> Scheffler Jens (XC-AS/EAE-ADA-T) 
> >> schrieb am Sa., 18. Mai 2024, 15:40:
> >>
> >> > Hi all,
> >> >
> >> >
> >> >
> >> > Discussion thread:
> >> >
> >> > https://lists.apache.org/thread/8hlltm9brdxqyf8jyw1syrfb11hl52k5
> >> >
> >> >
> >> >
> >> > I would like to officially call for a vote to add a Remote Executor
> >> > feature to Airflow – via a provider package. All details are
> documented
> >> in:
> >> >
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-69+Remote+Executor
> >> >
> >> >
> >> >
> >> > Since the first draft was raised and the discussion thread went
> along, I
> >> > integrated a bit of feedback and now also added some technical
> details as
> >> > proposed development. After successful vote I’d raise a draft PR as
> PoC.
> >> > As a big wave of emails was posted by Jarek after I dropped this AIP
> I’d
> >> > like to highlight that I propose to make this a tactical
> implementation
> >> > which might be a base for some discussion how to distribute work in a
> >> > future Airflow 3.0. I would assume if interfaces and structures
> change,
> >> > rework will be needed and it is fully accepted that breaking changes
> need
> >> > to be applied if moving to Airflow 3.
> >> >
> >> >
> >> >
> >> > Looking forward to releasing this for Airflow 2.10 (but depending how
> >> fast
> >> > I can make it and also depending on if somebody wants to join forces).
> >> >
> >> >
> >> >
> >> > Consider this my +1 binding vote.
> >> >
> >> >
> >> >
> >> > The vote will last until 6:00 PM GMT/UTC on May 23, 2024, and until at
> >> > least 3 binding votes have been cast. I have it a bit longer as usual
> >> > because of a public holiday in some countries.
> >> >
> >> >
> >> >
> >> > Please vote accordingly:
> >> >
> >> >
> >> >
> >> > [ ] + 1 approve
> >> >
> >> > [ ] + 0 no opinion
> >> >
> >> > [ ] - 1 disapprove with the reason
> >> >
> >> >
> >> >
> >> > Only votes from PMC members and committers are binding, but other
> members

Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-15 Thread Jarek Potiuk
The 2.9 compatibility tests are now merged !  The 2.8 PR  is almost ready
as a follow up and the next thing will be 2.7.

Small thing for everyone to take a look at now is to make sure the tests
are also passing 2.9 (and later 2.8 and 2.7) - but this should be just a
job failing in your PRs if they aren't. I also added documentation
https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#running-provider-compatibility-tests
explaining how to handle common cases (based on the changes merged for 2.9
compatibility) and explaining how you can easily reproduce such
compatibility test locally,

When compatibility tests will fail, the error message will explain it and
link to that documentation. I hope this will help to get it seamlessly
continued in the future, if you have any problems - ping me on slack I will
try to help if you will have any issues (but till Tuesday I am on PyCon so
don't expect my usual availability).

J.

On Mon, May 13, 2024 at 11:02 AM Jarek Potiuk  wrote:

> OK. Tests should be green now - all the issues are "handled" - there are
> few follow up tasks from the tests run on 2.9.1 but the PR should be quite
> ready for final review and merge now and I can attempt to look at 2.8
> compatibility once it's done.
>
> On Mon, May 13, 2024 at 1:17 AM Jarek Potiuk  wrote:
>
>> OK. I think I found and fixed all the compatibility issues in
>> https://github.com/apache/airflow/pull/39513 - except one last
>> openlineage plugin enablement fix (but I think reviews for all the other
>> changes would be great until we fix the issue). There are probably a few
>> incompatibilities that will need to be addressed before we release 2.1.10
>> so I need confirmation / comments from Niko/Daniel if my findings are
>> correct.
>>
>> On Fri, May 10, 2024 at 12:54 PM Jarek Potiuk  wrote:
>>
>>> > Just for clarification, this is only related to the provider's tests,
>>> right?
>>>
>>> Absolutely.
>>>
>>> On Fri, May 10, 2024 at 11:21 AM Andrey Anshin 
>>> wrote:
>>>
>>>> > "enable" tests for 2.8 and 2.7 separately
>>>>
>>>> Just for clarification, this is only related to the provider's tests,
>>>> right?
>>>>
>>>>
>>>>
>>>> On Fri, 10 May 2024 at 13:15, Jarek Potiuk  wrote:
>>>>
>>>> > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to
>>>> move
>>>> > forward to the initial version of Airflow 3 providers might already
>>>> drop
>>>> > support of these versions in providers.
>>>> > Airflow 2.7 in the mid of August 2024
>>>> > Airflow 2.8 in the mid of December 2024
>>>> >
>>>> > Yep. But also "here and now" those compatibility tests might help us
>>>> to
>>>> > find some hidden incompatibilities (and prevent adding future ones).
>>>> We can
>>>> > see how much complexity we are dealing with when we attempt to enable
>>>> the
>>>> > tests for 2.8 and then 2.7 and decide if it's worth it. The change I
>>>> added
>>>> > makes it easy to just "enable" tests for 2.8 and 2.7 separately.
>>>> >
>>>> > On Fri, May 10, 2024 at 11:10 AM Andrey Anshin <
>>>> andrey.ans...@taragol.is>
>>>> > wrote:
>>>> >
>>>> > > BTW, forget to mention that we should also check Pytest: Good
>>>> Integration
>>>> > > Practices from
>>>> > > https://docs.pytest.org/en/stable/explanation/goodpractices.html
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Fri, 10 May 2024 at 13:07, Andrey Anshin <
>>>> andrey.ans...@taragol.is>
>>>> > > wrote:
>>>> > >
>>>> > > > I think the current solution with run tests against installed
>>>> packages
>>>> > > > might help with future modifications and develop new dev
>>>> experience.
>>>> > And
>>>> > > > what is more important is help to find problems and
>>>> incompatibilities
>>>> > of
>>>> > > > providers with the previous version of Airflow "here and now".
>>>> > > >
>>>> > > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to
>>>> move
>>>> > > > forward to 

Re: [HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs strategic (Airflow 3) approach

2024-05-13 Thread Jarek Potiuk
Super-excited about that.

Question/Proposal: Can we have it possible to have two (or maybe three -
like a sub-committee) co-owners of topics? I think it's a lot to put on
one's head to "own" a topic and given circumstances/ volunteer time of
people, interruptions (and life intervening), it might be a bit risky to
put it on one's shoulders only.

I know it's against the rule ("if it is owned by many, it's not owned by
anyone") - but I think in our case there are at least some topics that
could benefit from having more than one owner. Especially when we know and
trust that we can work together on some topics that we are passionate
about. It might also encourage getting out of people's comfort zones.

For example - I'd absolutely love to volunteer to co-own the "streamline
the development" with Andrey if he would be willing to of course :D (sorry
Andrey for "volunteering you" on that one :D) - and maybe we could get
someone else to join us.

That might have the added benefit of being able to break with the way
we've been doing things. If I am owning it for one - I'd likely gravitate
towards past choices, but with others joining me and taking decisions (and
responsibility in making sure we implement them) together, we could make
better decisions and reduce bus factor for dev tooling/ CI in the future.

BTW.  Shameless promotion: tomorrow I am giving a talk about that very
topic (in the context of last few years not yet Airflow 3.0) at the NY
meetup hosted at Astronomer NY headquarters
https://www.meetup.com/nyc-apache-airflow-meetup/events/300017228/ - so if
you are in NY or around - I think you can stil sign up :D. I am also
getting to PyCon US in Pittsburgh next week so don't expect too much from
me. I will be gearing up for streamlining the development by talking to the
right people and listening to the latest things and best practices of the
larger Python community :).

J.

On Tue, May 14, 2024 at 12:03 AM Kaxil Naik  wrote:

> Thank you all, I am very happy about the discussions.
>
> The mailing list moves fast :). The main reason I recommended starting the
> dev calls in early June was to have some of these discussions on the
> mailing list.
>
> Since Michal already scheduled a call, let's start there to discuss
> various ideas. For the week after that, I have created an Airflow 2-style
> recurring open dev calls for anyone to join, info below:
>
> *Date & Time: *Recurring every 2 weeks on Thursday at* 4pm BST *( 3 PM
> GMT/UTC | 11 AM EST | 8 AM PST); starting* May 30, 2024 04:00 PM BST* and
> then
> *One-time registration Link*:
>
> https://astronomer.zoom.us/meeting/register/tZAsde2vqDwpE9XrBAbCeIFHA_l7OLywrWkG
> *Add to your calendar*:
>
> https://astronomer.zoom.us/meeting/tZAsde2vqDwpE9XrBAbCeIFHA_l7OLywrWkG/calendar/google/add
>
> I will post the meeting notes on the dev mailing list as well as Confluence
> for archival purposes (example
> ).
>
> Once we discuss various proposals next week, I recommend that for each
> "workstream", we have an owner who would want to lead that workstream. For
> items, that does not have an owner we can put those into Airflow 3 Meta
> issue  or cross-link over
> there so someone in the community can take it on. If we don't have an owner
> who will commit to working on it, we park that item until we find the
> owner.
>
> At the end of each call, I would solicit ideas for the agenda for the next
> call and propose it to the broader group on the mailing list.
>
> Some of the items that should be discussed in the upcoming calls IMO:
>
>- Agreeing on Principles
>
>Based on the discussions, some potential items (all up for debate)
>   - Considering Airflow 3.0 for early adopters and* breaking (and
>   removing) things for AF 3.0*. Things can be re-added as needed in
>   upcoming minor releases
>   - Optimize to get *foundational pieces in* and not "let perfect be
>   the enemy of good"
>   - Working on features that solidify Airflow as the* modern
>   Orchestrator* that also has state of the art *support for Data, AI &
>   ML workloads*. This includes scalability & performance discussion
>   - Set up the codebase for the next 5 years. This encompasses all the
>   things we are discussing e.g removing MySQL to reduce the test
> matrix,
>   simplifying things architecturally, consolidating serialization
> methods, etc
>
>   - Workstream & Stream Owners
>- Airflow 2 support policy including scope (feature vs bug fixes +
>security only) & support period
>- Separate discussions for each big workstream including one for items
>to remove & refactor (e.g dropping MySQL)
>- Discussion to streamline the development of Airflow 3
>   - Separating dev for Providers & Airflow (something Jarek already
>   kick-started), and
>   - Separate branch for Airflow 2
>   - CI changes for 

Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-13 Thread Jarek Potiuk
OK. Tests should be green now - all the issues are "handled" - there are
few follow up tasks from the tests run on 2.9.1 but the PR should be quite
ready for final review and merge now and I can attempt to look at 2.8
compatibility once it's done.

On Mon, May 13, 2024 at 1:17 AM Jarek Potiuk  wrote:

> OK. I think I found and fixed all the compatibility issues in
> https://github.com/apache/airflow/pull/39513 - except one last
> openlineage plugin enablement fix (but I think reviews for all the other
> changes would be great until we fix the issue). There are probably a few
> incompatibilities that will need to be addressed before we release 2.1.10
> so I need confirmation / comments from Niko/Daniel if my findings are
> correct.
>
> On Fri, May 10, 2024 at 12:54 PM Jarek Potiuk  wrote:
>
>> > Just for clarification, this is only related to the provider's tests,
>> right?
>>
>> Absolutely.
>>
>> On Fri, May 10, 2024 at 11:21 AM Andrey Anshin 
>> wrote:
>>
>>> > "enable" tests for 2.8 and 2.7 separately
>>>
>>> Just for clarification, this is only related to the provider's tests,
>>> right?
>>>
>>>
>>>
>>> On Fri, 10 May 2024 at 13:15, Jarek Potiuk  wrote:
>>>
>>> > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to
>>> move
>>> > forward to the initial version of Airflow 3 providers might already
>>> drop
>>> > support of these versions in providers.
>>> > Airflow 2.7 in the mid of August 2024
>>> > Airflow 2.8 in the mid of December 2024
>>> >
>>> > Yep. But also "here and now" those compatibility tests might help us to
>>> > find some hidden incompatibilities (and prevent adding future ones).
>>> We can
>>> > see how much complexity we are dealing with when we attempt to enable
>>> the
>>> > tests for 2.8 and then 2.7 and decide if it's worth it. The change I
>>> added
>>> > makes it easy to just "enable" tests for 2.8 and 2.7 separately.
>>> >
>>> > On Fri, May 10, 2024 at 11:10 AM Andrey Anshin <
>>> andrey.ans...@taragol.is>
>>> > wrote:
>>> >
>>> > > BTW, forget to mention that we should also check Pytest: Good
>>> Integration
>>> > > Practices from
>>> > > https://docs.pytest.org/en/stable/explanation/goodpractices.html
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Fri, 10 May 2024 at 13:07, Andrey Anshin <
>>> andrey.ans...@taragol.is>
>>> > > wrote:
>>> > >
>>> > > > I think the current solution with run tests against installed
>>> packages
>>> > > > might help with future modifications and develop new dev
>>> experience.
>>> > And
>>> > > > what is more important is help to find problems and
>>> incompatibilities
>>> > of
>>> > > > providers with the previous version of Airflow "here and now".
>>> > > >
>>> > > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to
>>> move
>>> > > > forward to the initial version of Airflow 3 providers might already
>>> > drop
>>> > > > support of these versions in providers.
>>> > > > Airflow 2.7 in the mid of August 2024
>>> > > > Airflow 2.8 in the mid of December 2024
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, 10 May 2024 at 12:32, Jarek Potiuk 
>>> wrote:
>>> > > >
>>> > > >> And yes - as we get down to 2.8 and 2.7 it might be possible that
>>> we
>>> > > will
>>> > > >> already implement some of the simplifications you mentioned as it
>>> > might
>>> > > be
>>> > > >> easier than adding back-compatiblity to the current ways. I
>>> assume it
>>> > > will
>>> > > >> be `quite` a bit harder to make our test suite work with Airflow
>>> 2.8
>>> > and
>>> > > >> then 2.7 - so it might be that some of the refactors and changes
>>> will
>>> > > need
>>> > > >> to be applied to make it easier to maintain.
>>> > > >>
>>> > > >> On Fri, May 10, 2024 at 10

Re: [VOTE] Airflow Providers prepared on May 12, 2024

2024-05-13 Thread Jarek Potiuk
+1 (binding): verified reproducibility, signatures, checksums, licences.

On Sun, May 12, 2024 at 9:34 PM Elad Kalif  wrote:

> Hey all,
>
> I have just cut the new wave Airflow Providers packages. This email is
> calling a vote on the release,
> which will last for 72 hours - which means that it will end on May 15, 2024
> 19:30 PM UTC and until 3 binding +1 votes have been received.
>
> Consider this my (binding) +1.
>
> Airflow Providers are available at:
> https://dist.apache.org/repos/dist/dev/airflow/providers/
>
> *apache-airflow-providers--*.tar.gz* are the binary
>  Python "sdist" release - they are also official "sources" for the provider
> packages.
>
> *apache_airflow_providers_-*.whl are the binary
>  Python "wheel" release.
>
> The test procedure for PMC members is described in
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-pmc-members
>
> The test procedure for and Contributors who would like to test this RC is
> described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-contributors
>
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but members of the community are
> encouraged to test the release and vote with "(non-binding)".
>
> Please note that the version number excludes the 'rcX' string.
> This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release.
>
> The status of testing the providers by the community is kept here:
> https://github.com/apache/airflow/issues/39578
>
> The issue is also the easiest way to see important PRs included in the RC
> candidates.
> Detailed changelog for the providers will be published in the documentation
> after the
> RC candidates are released.
>
> You can find the RC packages in PyPI following these links:
>
> https://pypi.org/project/apache-airflow-providers-amazon/8.22.0rc1/
> https://pypi.org/project/apache-airflow-providers-apache-iceberg/1.0.0rc1/
> https://pypi.org/project/apache-airflow-providers-google/10.18.0rc2/
>
> https://pypi.org/project/apache-airflow-providers-microsoft-azure/10.1.0rc2/
> https://pypi.org/project/apache-airflow-providers-pinecone/2.0.0rc2/
> https://pypi.org/project/apache-airflow-providers-tabular/1.5.1rc1/
>
> Cheers,
> Elad Kalif
>


Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-12 Thread Jarek Potiuk
OK. I think I found and fixed all the compatibility issues in
https://github.com/apache/airflow/pull/39513 - except one last openlineage
plugin enablement fix (but I think reviews for all the other changes
would be great until we fix the issue). There are probably a few
incompatibilities that will need to be addressed before we release 2.1.10
so I need confirmation / comments from Niko/Daniel if my findings are
correct.

On Fri, May 10, 2024 at 12:54 PM Jarek Potiuk  wrote:

> > Just for clarification, this is only related to the provider's tests,
> right?
>
> Absolutely.
>
> On Fri, May 10, 2024 at 11:21 AM Andrey Anshin 
> wrote:
>
>> > "enable" tests for 2.8 and 2.7 separately
>>
>> Just for clarification, this is only related to the provider's tests,
>> right?
>>
>>
>>
>> On Fri, 10 May 2024 at 13:15, Jarek Potiuk  wrote:
>>
>> > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to move
>> > forward to the initial version of Airflow 3 providers might already drop
>> > support of these versions in providers.
>> > Airflow 2.7 in the mid of August 2024
>> > Airflow 2.8 in the mid of December 2024
>> >
>> > Yep. But also "here and now" those compatibility tests might help us to
>> > find some hidden incompatibilities (and prevent adding future ones). We
>> can
>> > see how much complexity we are dealing with when we attempt to enable
>> the
>> > tests for 2.8 and then 2.7 and decide if it's worth it. The change I
>> added
>> > makes it easy to just "enable" tests for 2.8 and 2.7 separately.
>> >
>> > On Fri, May 10, 2024 at 11:10 AM Andrey Anshin <
>> andrey.ans...@taragol.is>
>> > wrote:
>> >
>> > > BTW, forget to mention that we should also check Pytest: Good
>> Integration
>> > > Practices from
>> > > https://docs.pytest.org/en/stable/explanation/goodpractices.html
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, 10 May 2024 at 13:07, Andrey Anshin > >
>> > > wrote:
>> > >
>> > > > I think the current solution with run tests against installed
>> packages
>> > > > might help with future modifications and develop new dev experience.
>> > And
>> > > > what is more important is help to find problems and
>> incompatibilities
>> > of
>> > > > providers with the previous version of Airflow "here and now".
>> > > >
>> > > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to
>> move
>> > > > forward to the initial version of Airflow 3 providers might already
>> > drop
>> > > > support of these versions in providers.
>> > > > Airflow 2.7 in the mid of August 2024
>> > > > Airflow 2.8 in the mid of December 2024
>> > > >
>> > > >
>> > > >
>> > > > On Fri, 10 May 2024 at 12:32, Jarek Potiuk 
>> wrote:
>> > > >
>> > > >> And yes - as we get down to 2.8 and 2.7 it might be possible that
>> we
>> > > will
>> > > >> already implement some of the simplifications you mentioned as it
>> > might
>> > > be
>> > > >> easier than adding back-compatiblity to the current ways. I assume
>> it
>> > > will
>> > > >> be `quite` a bit harder to make our test suite work with Airflow
>> 2.8
>> > and
>> > > >> then 2.7 - so it might be that some of the refactors and changes
>> will
>> > > need
>> > > >> to be applied to make it easier to maintain.
>> > > >>
>> > > >> On Fri, May 10, 2024 at 10:27 AM Jarek Potiuk 
>> > wrote:
>> > > >>
>> > > >> > Yep. I think these are all good ideas, and I think this should be
>> > part
>> > > >> of
>> > > >> > our big Airflow 2 vs. Airflow 3 discussion. Almost as important
>> as
>> > > what
>> > > >> is
>> > > >> > in and what is out is where and how development of different
>> > > components
>> > > >> > happen. Same repo? Different repos? Different branches? Single
>> > > monorepo
>> > > >> for
>> > > >> > Airflow2 + Providers, and separate repo for Airflow 3 only?
>> Keeping
>> > > &g

Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-10 Thread Jarek Potiuk
> Just for clarification, this is only related to the provider's tests,
right?

Absolutely.

On Fri, May 10, 2024 at 11:21 AM Andrey Anshin 
wrote:

> > "enable" tests for 2.8 and 2.7 separately
>
> Just for clarification, this is only related to the provider's tests,
> right?
>
>
>
> On Fri, 10 May 2024 at 13:15, Jarek Potiuk  wrote:
>
> > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to move
> > forward to the initial version of Airflow 3 providers might already drop
> > support of these versions in providers.
> > Airflow 2.7 in the mid of August 2024
> > Airflow 2.8 in the mid of December 2024
> >
> > Yep. But also "here and now" those compatibility tests might help us to
> > find some hidden incompatibilities (and prevent adding future ones). We
> can
> > see how much complexity we are dealing with when we attempt to enable the
> > tests for 2.8 and then 2.7 and decide if it's worth it. The change I
> added
> > makes it easy to just "enable" tests for 2.8 and 2.7 separately.
> >
> > On Fri, May 10, 2024 at 11:10 AM Andrey Anshin  >
> > wrote:
> >
> > > BTW, forget to mention that we should also check Pytest: Good
> Integration
> > > Practices from
> > > https://docs.pytest.org/en/stable/explanation/goodpractices.html
> > >
> > >
> > >
> > >
> > >
> > > On Fri, 10 May 2024 at 13:07, Andrey Anshin 
> > > wrote:
> > >
> > > > I think the current solution with run tests against installed
> packages
> > > > might help with future modifications and develop new dev experience.
> > And
> > > > what is more important is help to find problems and incompatibilities
> > of
> > > > providers with the previous version of Airflow "here and now".
> > > >
> > > > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to
> move
> > > > forward to the initial version of Airflow 3 providers might already
> > drop
> > > > support of these versions in providers.
> > > > Airflow 2.7 in the mid of August 2024
> > > > Airflow 2.8 in the mid of December 2024
> > > >
> > > >
> > > >
> > > > On Fri, 10 May 2024 at 12:32, Jarek Potiuk  wrote:
> > > >
> > > >> And yes - as we get down to 2.8 and 2.7 it might be possible that we
> > > will
> > > >> already implement some of the simplifications you mentioned as it
> > might
> > > be
> > > >> easier than adding back-compatiblity to the current ways. I assume
> it
> > > will
> > > >> be `quite` a bit harder to make our test suite work with Airflow 2.8
> > and
> > > >> then 2.7 - so it might be that some of the refactors and changes
> will
> > > need
> > > >> to be applied to make it easier to maintain.
> > > >>
> > > >> On Fri, May 10, 2024 at 10:27 AM Jarek Potiuk 
> > wrote:
> > > >>
> > > >> > Yep. I think these are all good ideas, and I think this should be
> > part
> > > >> of
> > > >> > our big Airflow 2 vs. Airflow 3 discussion. Almost as important as
> > > what
> > > >> is
> > > >> > in and what is out is where and how development of different
> > > components
> > > >> > happen. Same repo? Different repos? Different branches? Single
> > > monorepo
> > > >> for
> > > >> > Airflow2 + Providers, and separate repo for Airflow 3 only?
> Keeping
> > > >> > monorepo for Airflow 3 ? How do we cherry-pick?
> > > >> >  I think we need to "design" the developer experience as part of
> our
> > > >> > discussion - and it should be a serious discussion considering all
> > the
> > > >> > consequences. How do we test things together? How do we test
> > > >> > back-compatibility? How do we prevent Airflow 3 PRs breaking
> > > providers?
> > > >> > Should we separate-out Helm chart as well? There are many many
> > > questions
> > > >> > and multiple possible answers.
> > > >> >
> > > >> > But let's not derail this discussion - my proposal is to use what
> we
> > > >> have
> > > >> > now and simply get back-compatibility working without changing the
> > > >> > structure (yet

Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-10 Thread Jarek Potiuk
> Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to move
forward to the initial version of Airflow 3 providers might already drop
support of these versions in providers.
Airflow 2.7 in the mid of August 2024
Airflow 2.8 in the mid of December 2024

Yep. But also "here and now" those compatibility tests might help us to
find some hidden incompatibilities (and prevent adding future ones). We can
see how much complexity we are dealing with when we attempt to enable the
tests for 2.8 and then 2.7 and decide if it's worth it. The change I added
makes it easy to just "enable" tests for 2.8 and 2.7 separately.

On Fri, May 10, 2024 at 11:10 AM Andrey Anshin 
wrote:

> BTW, forget to mention that we should also check Pytest: Good Integration
> Practices from
> https://docs.pytest.org/en/stable/explanation/goodpractices.html
>
>
>
>
>
> On Fri, 10 May 2024 at 13:07, Andrey Anshin 
> wrote:
>
> > I think the current solution with run tests against installed packages
> > might help with future modifications and develop new dev experience. And
> > what is more important is help to find problems and incompatibilities of
> > providers with the previous version of Airflow "here and now".
> >
> > Regarding Airflow 2.7 and Airflow 2.8 in the time we are ready to move
> > forward to the initial version of Airflow 3 providers might already drop
> > support of these versions in providers.
> > Airflow 2.7 in the mid of August 2024
> > Airflow 2.8 in the mid of December 2024
> >
> >
> >
> > On Fri, 10 May 2024 at 12:32, Jarek Potiuk  wrote:
> >
> >> And yes - as we get down to 2.8 and 2.7 it might be possible that we
> will
> >> already implement some of the simplifications you mentioned as it might
> be
> >> easier than adding back-compatiblity to the current ways. I assume it
> will
> >> be `quite` a bit harder to make our test suite work with Airflow 2.8 and
> >> then 2.7 - so it might be that some of the refactors and changes will
> need
> >> to be applied to make it easier to maintain.
> >>
> >> On Fri, May 10, 2024 at 10:27 AM Jarek Potiuk  wrote:
> >>
> >> > Yep. I think these are all good ideas, and I think this should be part
> >> of
> >> > our big Airflow 2 vs. Airflow 3 discussion. Almost as important as
> what
> >> is
> >> > in and what is out is where and how development of different
> components
> >> > happen. Same repo? Different repos? Different branches? Single
> monorepo
> >> for
> >> > Airflow2 + Providers, and separate repo for Airflow 3 only? Keeping
> >> > monorepo for Airflow 3 ? How do we cherry-pick?
> >> >  I think we need to "design" the developer experience as part of our
> >> > discussion - and it should be a serious discussion considering all the
> >> > consequences. How do we test things together? How do we test
> >> > back-compatibility? How do we prevent Airflow 3 PRs breaking
> providers?
> >> > Should we separate-out Helm chart as well? There are many many
> questions
> >> > and multiple possible answers.
> >> >
> >> > But let's not derail this discussion - my proposal is to use what we
> >> have
> >> > now and simply get back-compatibility working without changing the
> >> > structure (yet), but as part of Airflow 2 vs. Airflow 3 we should make
> >> sure
> >> > this topic is fully covered and we get to consensus on the answers.
> >> >
> >> > J
> >> >
> >> >
> >> > On Fri, May 10, 2024 at 9:17 AM Andrey Anshin <
> andrey.ans...@taragol.is
> >> >
> >> > wrote:
> >> >
> >> >> Great job, Jarek!
> >> >>
> >> >> I would have some proposals, which should be considered as a long
> term
> >> >>
> >> >>
> >> >> We should rework our test structure to fully run provider tests
> without
> >> >> touching the Core tests.
> >> >> The main problem here is that we configure a lot of things into the
> >> root
> >> >> conftest.py which might be a problem in case of running tests on a
> >> >> provider
> >> >> under a different version of the airflow. Core itself might use
> >> something
> >> >> which was only added in a recent version of Airflow, but this should
> >> not
> >> >> be
> >> >> a case in case of providers. So we should slig

Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-10 Thread Jarek Potiuk
And yes - as we get down to 2.8 and 2.7 it might be possible that we will
already implement some of the simplifications you mentioned as it might be
easier than adding back-compatiblity to the current ways. I assume it will
be `quite` a bit harder to make our test suite work with Airflow 2.8 and
then 2.7 - so it might be that some of the refactors and changes will need
to be applied to make it easier to maintain.

On Fri, May 10, 2024 at 10:27 AM Jarek Potiuk  wrote:

> Yep. I think these are all good ideas, and I think this should be part of
> our big Airflow 2 vs. Airflow 3 discussion. Almost as important as what is
> in and what is out is where and how development of different components
> happen. Same repo? Different repos? Different branches? Single monorepo for
> Airflow2 + Providers, and separate repo for Airflow 3 only? Keeping
> monorepo for Airflow 3 ? How do we cherry-pick?
>  I think we need to "design" the developer experience as part of our
> discussion - and it should be a serious discussion considering all the
> consequences. How do we test things together? How do we test
> back-compatibility? How do we prevent Airflow 3 PRs breaking providers?
> Should we separate-out Helm chart as well? There are many many questions
> and multiple possible answers.
>
> But let's not derail this discussion - my proposal is to use what we have
> now and simply get back-compatibility working without changing the
> structure (yet), but as part of Airflow 2 vs. Airflow 3 we should make sure
> this topic is fully covered and we get to consensus on the answers.
>
> J
>
>
> On Fri, May 10, 2024 at 9:17 AM Andrey Anshin 
> wrote:
>
>> Great job, Jarek!
>>
>> I would have some proposals, which should be considered as a long term
>>
>>
>> We should rework our test structure to fully run provider tests without
>> touching the Core tests.
>> The main problem here is that we configure a lot of things into the root
>> conftest.py which might be a problem in case of running tests on a
>> provider
>> under a different version of the airflow. Core itself might use something
>> which was only added in a recent version of Airflow, but this should not
>> be
>> a case in case of providers. So we should slightly change the test
>> structure, unless we could decouple providers for the mono repo (i'm not
>> sure it is even a case in the future). E.g. move tests/providers to
>> tests/providers/unit and after so w would have
>> tests/system/{unit|system|integration|conftest.py) maybe also some helpers
>> for providers should be moved into the tests/providers/helpers (I don't
>> like name helpers but this only for the reference). In the same momemen
>> move core related tests to the tests/core (name could be different) and
>> create structure like
>> tests/core/{unit|system|integration|helpers|conftest.py}. And move as much
>> as possible from tests/conftest.py to appropriate in
>> tests/{core|providers}/conftest.py
>>
>>
>> Providers tests should not be relied on DB backend, and could be easily
>> run
>> on any of the supported, because providers not extend DB backend support
>> DB, and if tests pass in core we take an assumption that providers could
>> use any of them e.g. SQlite (preferable for setup in xdists) or Postgres.
>>
>> If we go even further we might want to move specific helpers in the
>> separate test package, e.g `pytest-apache-airflow`, and move all common
>> helpers and simple setup/configuration tests airflow environment (really
>> simple one as first steps) and compatibility level, same as provider I
>> year
>> after feature version released. We could test this package against
>> different versions of airflow to make sure that within combination Airflow
>> (2.7-2.9 + main) + `pytest-apache-airflow` we could run tests against each
>> provider.
>> This pytest package also would be released, uploaded into the PyPI and
>> could be installed via pip/uv however at least for the initial stage it
>> shouldn't be considered to use outside of Airflow and Airflow Providers
>> CI,
>> in another word it is no GA for the end users. This might be changed in
>> the
>> future but let's focus that this package only for Airflow development
>> internals
>>
>> On Fri, 10 May 2024 at 01:08, Jarek Potiuk  wrote:
>>
>> > Hello everyone,
>> >
>> > As part of preparation for the Airflow 3 move and (possible) provider
>> > separation (I have some ideas how to do it but that should be a separate
>> > discussion) I took on the task of improving our compatibility tests for
>> > Providers. I 

Re: [PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-10 Thread Jarek Potiuk
Yep. I think these are all good ideas, and I think this should be part of
our big Airflow 2 vs. Airflow 3 discussion. Almost as important as what is
in and what is out is where and how development of different components
happen. Same repo? Different repos? Different branches? Single monorepo for
Airflow2 + Providers, and separate repo for Airflow 3 only? Keeping
monorepo for Airflow 3 ? How do we cherry-pick?
 I think we need to "design" the developer experience as part of our
discussion - and it should be a serious discussion considering all the
consequences. How do we test things together? How do we test
back-compatibility? How do we prevent Airflow 3 PRs breaking providers?
Should we separate-out Helm chart as well? There are many many questions
and multiple possible answers.

But let's not derail this discussion - my proposal is to use what we have
now and simply get back-compatibility working without changing the
structure (yet), but as part of Airflow 2 vs. Airflow 3 we should make sure
this topic is fully covered and we get to consensus on the answers.

J


On Fri, May 10, 2024 at 9:17 AM Andrey Anshin 
wrote:

> Great job, Jarek!
>
> I would have some proposals, which should be considered as a long term
>
>
> We should rework our test structure to fully run provider tests without
> touching the Core tests.
> The main problem here is that we configure a lot of things into the root
> conftest.py which might be a problem in case of running tests on a provider
> under a different version of the airflow. Core itself might use something
> which was only added in a recent version of Airflow, but this should not be
> a case in case of providers. So we should slightly change the test
> structure, unless we could decouple providers for the mono repo (i'm not
> sure it is even a case in the future). E.g. move tests/providers to
> tests/providers/unit and after so w would have
> tests/system/{unit|system|integration|conftest.py) maybe also some helpers
> for providers should be moved into the tests/providers/helpers (I don't
> like name helpers but this only for the reference). In the same momemen
> move core related tests to the tests/core (name could be different) and
> create structure like
> tests/core/{unit|system|integration|helpers|conftest.py}. And move as much
> as possible from tests/conftest.py to appropriate in
> tests/{core|providers}/conftest.py
>
>
> Providers tests should not be relied on DB backend, and could be easily run
> on any of the supported, because providers not extend DB backend support
> DB, and if tests pass in core we take an assumption that providers could
> use any of them e.g. SQlite (preferable for setup in xdists) or Postgres.
>
> If we go even further we might want to move specific helpers in the
> separate test package, e.g `pytest-apache-airflow`, and move all common
> helpers and simple setup/configuration tests airflow environment (really
> simple one as first steps) and compatibility level, same as provider I year
> after feature version released. We could test this package against
> different versions of airflow to make sure that within combination Airflow
> (2.7-2.9 + main) + `pytest-apache-airflow` we could run tests against each
> provider.
> This pytest package also would be released, uploaded into the PyPI and
> could be installed via pip/uv however at least for the initial stage it
> shouldn't be considered to use outside of Airflow and Airflow Providers CI,
> in another word it is no GA for the end users. This might be changed in the
> future but let's focus that this package only for Airflow development
> internals
>
> On Fri, 10 May 2024 at 01:08, Jarek Potiuk  wrote:
>
> > Hello everyone,
> >
> > As part of preparation for the Airflow 3 move and (possible) provider
> > separation (I have some ideas how to do it but that should be a separate
> > discussion) I took on the task of improving our compatibility tests for
> > Providers. I discussed it briefly with Kaxil and Ash and decided to give
> it
> > a go and see what it takes.
> >
> > The PR here: https://github.com/apache/airflow/pull/39513
> >
> > I extended our "import" checks with checks that also run all provider
> unit
> > tests for specified airflow versions (for now 2.9.1 - but once we get it
> > merged/approved we can make sure the tests are working for 2.7 and 2.8).
> We
> > will also be able to run "future" compatibility tests in case we decide
> to
> > leave providers aside from Airflow 3 and will be able to run the tests
> for
> > both`main` and `pypi`-released versions of airflow.
> >
> > A number of our tests rely on some internals of Airflow  and they
> > implicitly rely on the fact that th

[PROPOSAL] Improved compatiblity checks for Providers (running unit tests for multiple airflow versions)

2024-05-09 Thread Jarek Potiuk
Hello everyone,

As part of preparation for the Airflow 3 move and (possible) provider
separation (I have some ideas how to do it but that should be a separate
discussion) I took on the task of improving our compatibility tests for
Providers. I discussed it briefly with Kaxil and Ash and decided to give it
a go and see what it takes.

The PR here: https://github.com/apache/airflow/pull/39513

I extended our "import" checks with checks that also run all provider unit
tests for specified airflow versions (for now 2.9.1 - but once we get it
merged/approved we can make sure the tests are working for 2.7 and 2.8). We
will also be able to run "future" compatibility tests in case we decide to
leave providers aside from Airflow 3 and will be able to run the tests for
both`main` and `pypi`-released versions of airflow.

A number of our tests rely on some internals of Airflow  and they
implicitly rely on the fact that they are run directly in airflow source
tree - but there are not many of those - after some initial compatibility
fixes I got 50 or so tests failing for 2.9.1 (probably there will be more
for 2.8.0 and 2.7.0, but I want to make 2.9.1 works first).

I almost got it working (few tests are still failing) with compatibility
for 2.9.1 but I will need some help from a few people - around
openlineage and serialization but also around recently improved try_number
:). I will reach out to the relevant people individually if we see that as
a good idea.

It requires some care when writing tests to make sure the tests can be run
against installed airflow and not from sources. So in the future anyone
contributing provider changes will have to make sure the tests pass also
for past airflow versions (there are simple instructions explaining how to
do it with breeze). But once we merge it, this will be caught on PR level
and should be easy to fix any of those problems.

The benefit of having the tests is that we not only do simple import tests
but actually run provider tests, the drawback is that sometimes tests will
have to be adapted to make sure they work also for installed older airflow
versions (which is not always straightforward or easy and will need some
compatibility code in tests - for example after recent rename of
airflow.models.ImportError to ParsingImportError we had to add compat.py to
test_utils and import ParsingImportError from there rather than from
Airflow directly in tests.

I don't think it's too controversial - being able to run unit tests for
providers for old (and future) versions of Airflow is generally quite an
improvement in stability, but this adds a bit overhead on contributions, so
I am letting everyone here know it's coming, so that it's not a surprise to
contributors.

J.


Re: [VOTE] AIP-67 Multi-team deployment of Airflow components

2024-05-09 Thread Jarek Potiuk
Just to clarify the state for that one.

I would like to put that one on-hold until we get clarity on Airflow 2 vs
Airflow 3 approach:
https://lists.apache.org/thread/3chvg9964zvh15mtrbl073f4oj3nlzp2

There is currently a veto from Ash, so until it is withdrawn or we change
the problematic "team" database schema modification approach. I think
the choice we made here depends a lot on the Airflow 3 discussions.

We have those options here:

* Treat Airflow 2 multi-team approach as a "tactical" solution and
implement it in a non-future compliant way (and make use of the Airflow 3
feature to implement it better for Airflow 3). This one is simplest and has
very limited impact on UI/API/DB etc. (so basically ripple effect)
* Implement Multi-team as Future-proof in Airflow 2 with proper schema
changes and ripple effects it might have for the UI, API and all the other
components
* only implement multi-team as an Airflow 3 feature (which might be much
easier to do - depending on the scope of Airflow 3 changes we will target -
some of the changes proposed have a significant overlap with the multi-team
proposal and we should make sure to discuss it as part of our Airflow 3
planning.

I currently do not know which option is best - as a lot depends on Airflow
3 discussions. So I think putting this on hold and deciding what to do
after we have more clarity is the best approach.

J.




On Wed, Apr 24, 2024 at 9:06 PM Mehta, Shubham 
wrote:

> +1 (non-binding). Looking forward to this one.
>
> Shubham
>
> On 2024-04-22, 12:31 AM, "Amogh Desai"  amoghdesai@gmail.com>> wrote:
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
>
>
>
> +1 binding.
>
>
> Excited to see this happen!
>
>
> Thanks & Regards,
> Amogh Desai
>
>
>
>
> On Sat, Apr 20, 2024 at 12:11 AM Igor Kholopov  lid>
> wrote:
>
>
> > +1 (non-binding)
> >
> > Great to see this happening, hope we will see more proposals towards
> making
> > Airflow more flexible!
> >
> > Regards,
> > Igor
> >
> > On Fri, Apr 19, 2024 at 8:10 PM Daniel Standish
> >  daniel.stand...@astronomer.io.inva>lid> wrote:
> >
> > > >
> > > > It doesn’t affect my vote on the API, but I am very strongly against
> > this
> > > > one part of the AIP:
> > > > > … dag_id are namespaced with `:` prefix.
> > > > This specific part is getting an implementation/code veto from me. We
> > > made
> > > > the mistake of overloading one column to store multiple things in
> > Airflow
> > > > before, and I’ve dealt with the fallout in other apps in the past.
> > Trust
> > > > me: do. not. do. this.
> > >
> > > I agree with Ash's sentiment. Is adding a tenant_id or something so
> > > unpalatable?
> > >
> >
>
>
>
>


Re: [HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs strategic (Airflow 3) approach

2024-05-08 Thread Jarek Potiuk
e are willing to cut - which could be a painstaking
> process that could offset any gains of trying to be faster. I also believe
> that bringing all these new amazing features on Airflow 3 will peak the
> interest of early adopters and eventually get others interested in
> migration. However, I believe this migration will be a slow process and
> will present a gap in certain functionalities that users may want before
> entertaining any move to Airflow 3. There are still a lot of folks using
> v1.10 today. There were several tactical initiatives in the past few months
> with intent on bringing new functionality, ie Multi-team, to Airflow 2.x
> and I feel that while these efforts are not wasted, there should still be
> an option to continue improving Airflow 2 to avoid alienating our users on
> the basis of a future promise in Airflow 3, that may not be easy to migrate
> towards.
>
> -- Rajesh
>
>
>
>
>
>
>
>
> dev@airflow.apache.org <mailto:dev@airflow.apache.org> <
> dev@airflow.apache.org <mailto:dev@airflow.apache.org>>
>
>
> On 2024-05-07, 12:26 PM, "Constance Martineau"
> mailto:consta...@astronomer.io.inva>
> <mailto:consta...@astronomer.io.inva 
> <mailto:consta...@astronomer.io.inva>>LID>
> wrote:
>
>
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
>
>
>
>
>
>
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
>
>
>
>
>
>
>
>
>
> Thank you Jarek for the detailed input. I've taken some time to digest your
> points before responding.
>
>
>
>
> You've outlined a bold vision for Airflow 3, and I agree that being
> decisive about the features and architectures will be the key to success.
> However, before we make final decisions on what features to cut or retain,
> it would be beneficial to have a more comprehensive understanding of how
> the current features are utilized by the open-source community.
>
>
>
>
> @Kaxil Naik mailto:ka...@astronomer.io>  ka...@astronomer.io <mailto:ka...@astronomer.io>>> recently initiated a
> discussion on
> collecting telemetry from open-source deployments:
> https://lists.apache.org/thread/7f6qyr8w2n8w34g63s7ybhzphgt8h43m <
> https://lists.apache.org/thread/7f6qyr8w2n8w34g63s7ybhzphgt8h43m> <
> https://lists.apache.org/thread/7f6qyr8w2n8w34g63s7ybhzphgt8h43m> <
> https://lists.apache.org/thread/7f6qyr8w2n8w34g63s7ybhzphgt8h43m;>.
> This data
> could be critical in ensuring our decisions are well-informed and reflect
> the real-world usage patterns of our users, not just those from managed
> environments like Astro, MWAA or GCC.
>
>
>
>
> It's essential that we challenge our assumptions and base our decisions on
> a holistic view of feature usage. Identifying potential cuts is a critical
> step, but let's ensure our strategy aligns with the needs and preferences
> of the broader Airflow community.
>
>
>
>
> On Mon, May 6, 2024 at 6:50 PM Jarek Potiuk  ja...@potiuk.com> <mailto:ja...@potiuk.com <mailto:ja...@potiuk.com>>>
> wrote:
>
>
>
>
> > I am currently on sick leave, and still recovering - hoping to be able to
> > travel next week to the US as planned, so I just wanted to break out of
> it
> > to make one comment here.
> >
> > I got a clearer head now a bit with medications hopefully working. I am
> > still taking it that should help me to get over the current state, and I
> > wanted to take a look at this discussion unraveling first. Over last week
> > I disconnected from "day-to-day" Airflow and put some thoughts (as much
> as
> > I could in my current state) on it. The whole subject of this thread was
> > started from that - how the current discussions on AIP-67 and others
> change
> > if we consider Airflow 3 is "starting".
> >
> > The price for back-compat is speed of development and quality. More
> > combinations to test, more unexpected issues uncovered, necessity to keep
> > parallel paths (old/new) while adding new features. All what Constance
> > wrote about and what Ash explained. We already started to trip over our
> own
> > feet mutliple times in a few last releases. Have we tested all
> combinations
> > of deployment in Airflow 2.8 and

Re: [VOTE] Proposal for adding Telemetry via Scarf

2024-05-08 Thread Jarek Potiuk
Short reminder and correction :).

Wei Lee - as a committer, your vote is binding for any votes except
releases. Releases are special - they are a legal act of the
Apache Software Foundation, so when you vote on releases, only PMC votes
are binding.

Generally "releasing software" is what the ASF Foundation does. The
foundation does not create software, only releases it ("for the public
good").
The PMC member is an official role in the ASF bylaws
https://www.apache.org/foundation/bylaws.html  (Apache Airflow is a PMC).
This is according to Delaware laws - that's where Foundation is registered
as https://en.wikipedia.org/wiki/501(c)(3)_organization non-profit
organisation,

This allows us to release software without the fear that someone will sue
us personally if they are harmed by it, because if - as PMC members - we
follow ASF rules (minimum 3 PMC members, reproducibility check, signatures,
etc.) ASF indemnifies us personally from any harm done to anyone using that
released software.

But all the other decisions in Airflow are voted by committers:
https://github.com/apache/airflow?tab=readme-ov-file#voting-policy - so
committer votes are binding (except releases).

J.

On Thu, May 9, 2024 at 6:46 AM Jarek Potiuk  wrote:

> +1 (binding)
>
> On Thu, May 9, 2024 at 4:47 AM Wei Lee  wrote:
>
>> +1 non-binding
>>
>> Best,
>> Wei
>>
>> > On May 9, 2024, at 10:39 AM, Phani Kumar 
>> > 
>> wrote:
>> >
>> > +1 binding, looking forward to add Scarf
>> >
>> > On Thu, May 9, 2024 at 7:42 AM Kaxil Naik  wrote:
>> >
>> >> Hi all,
>> >>
>> >> Discussion thread:
>> >> https://lists.apache.org/thread/7f6qyr8w2n8w34g63s7ybhzphgt8h43m
>> >>
>> >> I would like to officially call for a vote to add Scarf as a Telemetry
>> >> tool. Some other things:
>> >>
>> >>   - Opt-in by default
>> >>   - Explicit documentation that we collect the telemetry data
>> >>   - Opt-out via airflow.cfg and env var
>> >>   - Works for air-gapped environments
>> >>   - Initial access to only PMC members
>> >>
>> >> I have created a free account on Scarf and added it to the shared
>> 1password
>> >> (only PMC members have access to it). For now, I am just playing around
>> >> with how the information can be shown.
>> >>
>> >> I have a draft PR: https://github.com/apache/airflow/pull/39510 that
>> >> collects some basic info, adds docs & tests.
>> >>
>> >> Looking forward to releasing this for Airflow 2.10.
>> >>
>> >> Consider this my +1 binding vote.
>> >>
>> >> The vote will last until 04:20 GMT/UTC on May 16, 2024, and until at
>> >> least 3 binding votes have been cast.
>> >>
>> >> Please vote accordingly:
>> >>
>> >> [ ] + 1 approve
>> >> [ ] + 0 no opinion
>> >> [ ] - 1 disapprove with the reason
>> >>
>> >> Only votes from PMC members and committers are binding, but other
>> members
>> >> of the community are encouraged to check the AIP and vote with
>> >> "(non-binding)".
>> >>
>> >> Regards,
>> >> Kaxil
>> >>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
>> For additional commands, e-mail: dev-h...@airflow.apache.org
>>
>>


Re: [VOTE] Proposal for adding Telemetry via Scarf

2024-05-08 Thread Jarek Potiuk
+1 (binding)

On Thu, May 9, 2024 at 4:47 AM Wei Lee  wrote:

> +1 non-binding
>
> Best,
> Wei
>
> > On May 9, 2024, at 10:39 AM, Phani Kumar 
> wrote:
> >
> > +1 binding, looking forward to add Scarf
> >
> > On Thu, May 9, 2024 at 7:42 AM Kaxil Naik  wrote:
> >
> >> Hi all,
> >>
> >> Discussion thread:
> >> https://lists.apache.org/thread/7f6qyr8w2n8w34g63s7ybhzphgt8h43m
> >>
> >> I would like to officially call for a vote to add Scarf as a Telemetry
> >> tool. Some other things:
> >>
> >>   - Opt-in by default
> >>   - Explicit documentation that we collect the telemetry data
> >>   - Opt-out via airflow.cfg and env var
> >>   - Works for air-gapped environments
> >>   - Initial access to only PMC members
> >>
> >> I have created a free account on Scarf and added it to the shared
> 1password
> >> (only PMC members have access to it). For now, I am just playing around
> >> with how the information can be shown.
> >>
> >> I have a draft PR: https://github.com/apache/airflow/pull/39510 that
> >> collects some basic info, adds docs & tests.
> >>
> >> Looking forward to releasing this for Airflow 2.10.
> >>
> >> Consider this my +1 binding vote.
> >>
> >> The vote will last until 04:20 GMT/UTC on May 16, 2024, and until at
> >> least 3 binding votes have been cast.
> >>
> >> Please vote accordingly:
> >>
> >> [ ] + 1 approve
> >> [ ] + 0 no opinion
> >> [ ] - 1 disapprove with the reason
> >>
> >> Only votes from PMC members and committers are binding, but other
> members
> >> of the community are encouraged to check the AIP and vote with
> >> "(non-binding)".
> >>
> >> Regards,
> >> Kaxil
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>


Re: [HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs strategic (Airflow 3) approach

2024-05-06 Thread Jarek Potiuk
I am currently on sick leave, and still recovering - hoping to be able to
travel next week to the US as planned, so I just wanted to break out of it
to make one comment here.

I got a clearer head now a bit with medications hopefully working. I am
still taking it that should help me to get over the current state, and I
wanted to take a look at this discussion  unraveling first. Over last week
I disconnected from "day-to-day" Airflow and put some thoughts (as much as
I could in my current state) on it. The whole subject of this thread was
started from that - how the current discussions on AIP-67 and others change
if we consider Airflow 3 is "starting".

The price for back-compat is speed of development and quality. More
combinations to test, more unexpected issues uncovered, necessity to keep
parallel paths (old/new) while adding new features. All what Constance
wrote about and what Ash explained. We already started to trip over our own
feet mutliple times in a few last releases. Have we tested all combinations
of deployment in Airflow 2.8 and 2.9 - not really, I think we already see
that in a number of "combos" of features things are not working in as
stable a way as they did before.

Airflow 3 is a bold move. We risk users will stay on Airflow 2 for a long
time (or even move out) as they will not want to move to Airflow 3. A lot
of the work implemented in AIP-44 and design of AIP-67 was done around
back-compatibility. but yes -
it would have been way easier if designed anew without back-compatibility
in mind. And if we implement it and release it in Airflow 2 it will make
new Airflow feature development even harder. That's why I wanted to treat
it as "tactical" solution - hoping that in Airflow 3 we can make it
"properly" - and that's why I started the discussion here when I sensed
that we are "close" to Airflow 3 discussion, because I wanted to see what
options we have there. This is why I have not yet concluded voting on
AIP-67 waiting for the result of this discussion here.

But if we are ready to go for Airflow.3 then I'd say there are two
important things that should be part of the vision.

1)  *We should be far more opinionated and have far fewer options of
running things in Airflow 3*. Even an order of magnitude more opinionated.
Make choices, stick to it, perfect those opinionated choices to suit 80/20
(or even 70/30 or maybe even 60/40) rule if you will. Risking not fitting
the 20% that might choose to stay at Airflow 2. We can choose now which
~20% of cases we do not want to handle deliberately. And we should be very,
very strict about it. Default should be "no choice". This will radically
simplify deployment and should make it easier to simplify Airflow
development and DAG authoring experience because we will have less cases to
support. Even if we plan to add more options in the future, the first
version of Airflow 3 should support one deployment approach only. This is
the only way we can deliver it fast. And we should be very bold there.
Choose one option and go for it in pretty much every place we have choices
now. We should Aim for Airflow 3.0 to support only a subset of current
users - but those who are most likely to migrate first and those with the
biggest need for the new features. We can think 3.x to support more cases,
but 3.0 should be as opinionated as humanly possible.

And this deployment option should be also something ALL our stakeholders
will feel OK with as a way forward in their offering.

My candidates (and yes, some are bold):

* *Drop MySQL*. If we have a single thing that makes us avoid our schema
and DB migration - this is the case. Let's choose Postgres 15+ and use some
of the great features there. This will also enable much faster async SQL
implementation and a number of other optimisations - not to mention cutting
every single change in development and testing time by literally half. And
we should not look back to adding MySQL.
* *Drop Celery/Sequential Executor* and start with Local + K8S only (and
AWS/Google others can continue developing theirs of course in parallel and
continue Hybrid executor work). Later - we figure out a better solution to
support "small" tasks using some new K8S features and possibly non-k8s
solutions (Ray-based?)
* *Cut Connection and Variable Management from DB/UI*. Leave only Secrets
Management. Later when we have a 100% extensible React UI, we can add a
"local DB secrets manager" add-on
* *Choose a single way for DAG storage that will support versioning from
day one*. Bear in mind we can add others later. Bolke's idea of using
FSspec is an interesting one, we should see if it is feasible.
* *Drop FAB completely (including custom plugins) and invest in
implementing Auth Manager based on a dedicated, external solution* (KeyCloak
that we've discussed before as a likely candidate)
* *Leave Providers with Airflow 2 and add tests to make sure they are
Airflow 3 future-compatible *- develop a way where we continue development
and contributions 

Re: [VOTE] Release Apache Airflow Python Client 2.9.0 from 2.9.0rc1

2024-04-23 Thread Jarek Potiuk
+1 (binding): tested reproducibility, signatures, checksums, licences (code
is autogenerated), run some basic tests with Airflow 2.9.0 with the RC1
client

On Tue, Apr 23, 2024 at 1:05 AM Jed Cunningham 
wrote:

> Hey fellow Airflowers,
>
> I have cut the first release candidate for the Apache Airflow Python Client
> 2.9.0.
> This email is calling for a vote on the release,
> which will last for 72 hours. Consider this my (binding) +1.
>
> Airflow Client 2.9.0rc1 is available at:
> https://dist.apache.org/repos/dist/dev/airflow/clients/python/2.9.0rc1/
>
> The apache_airflow_client-2.9.0.tar.gz is an sdist release that contains
> INSTALL instructions, and also is the official source release.
>
> The apache_airflow_client-2.9.0-py3-none-any.whl is a binary wheel release
> that pip can install.
>
> Those packages do not contain .rc* version as, when approved, they will be
> released as the final version.
>
> The rc packages are also available at PyPI (with rc suffix) and you can
> install it with pip as usual:
> https://pypi.org/project/apache-airflow-client/2.9.0rc1/
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Only votes from PMC members are binding, but all members of the community
> are encouraged to test the release and vote with "(non-binding)".
>
> The test procedure for PMC members is described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PYTHON_CLIENT.md#verify-the-release-candidate-by-pmc-members
>
> The test procedure for contributors and members of the community who would
> like to test this RC is described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PYTHON_CLIENT.md#verify-the-release-candidate-by-contributors
>
> *Changelog:*
>
> *Major changes:*
>
> - Allow users to write dag_id and task_id in their national characters,
> added display name for dag / task (v2) (#38446)
> - Add dataset_expression to grid dag details (#38121)
> - Adding run_id column to log table (#37731)
> - Show custom instance names for a mapped task in UI (#36797)
> - Add excluded/included events to get_event_logs api (#37641)
> - Filter Datasets by associated dag_ids (GET /datasets) (#37512)
> - Add data_interval_start and data_interval_end in dagrun create API
> endpoint (#36630)
> - Return the specified field when get dag/dagRun (#36641)
>
> *New API supported:*
>
> - Add post endpoint for dataset events (#37570)
> - Add "queuedEvent" endpoint to get/delete DatasetDagRunQueue (#37176)
>
> Thanks,
> Jed
>


Re: [HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs strategic (Airflow 3) approach

2024-04-22 Thread Jarek Potiuk
I tried to import the both emails I saw in the thread
> > into the page as starter. As it is a call to collaborate, please start
> > editing and drop your points as well.
> >
> > Towards Jarek's mentioned trigger points:
> > Actually the dropped AIP-68 and AIP-69 are something that in my view do
> > NOT require Airflow to get to 3.0. I would see them either "Tactical" or
> > "just functional enhancements". AIP-68 is "just" a bit of sugar to UI and
> > extensions to Plugin interface in my view. AIP-69 is basically building
> > something on-top, based on the concept of Hybrid Executors. As long as we
> > would assume AIP-69 does not need drastical changes, maybe only small
> > adjustments in the core (but concept not elaborated yet). I see this
> mainly
> > as "just another Executor" that should not need breaking changes. I did
> not
> > want to drop these two AIP's to start a fundamental discussion but rather
> > to bring-in a new feature each.
> > The points as factors that are hard to achieve in Airflow 2.x world are
> > rather the "Multi Tenancy/Team" and "Dag Versioning" which in my eyes
> might
> > be able to move faster with a 3.0.
> >
> > P.S.: I do not get the point (yet?) Why GenAI is a trigger point that
> > forced structural breaking changes?
> >
> > Mit freundlichen Grüßen / Best regards
> >
> > Jens Scheffler
> >
> > Alliance: Enabler - Tech Lead (XC-AS/EAE-ADA-T)
> > Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> > GERMANY | www.bosch.com
> > Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> > jens.scheff...@de.bosch.com
> >
> > Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> > Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> > Geschäftsführung: Dr. Stefan Hartung, Dr. Christian Fischer, Dr. Markus
> > Forschner,
> > Stefan Grosch, Dr. Markus Heyn, Dr. Frank Meyer, Dr. Tanja Rückert
> >
> > -Original Message-
> > From: Vikram Koka 
> > Sent: Saturday, April 20, 2024 6:23 PM
> > To: dev@airflow.apache.org
> > Subject: Re: [HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs
> > strategic (Airflow 3) approach
> >
> > A wonderful and exciting Saturday morning discussion!
> > Thank you Jarek for bringing the offline conversations into the mailing
> > list.
> >
> > I completely agree on the necessity of Airflow 3.
> > I also agree that Gen AI is the trigger i.e. the answer to "Why now"?
> >
> > Having been thinking about this for a while from a strategic perspective,
> > as opposed to the tactical perspective of the bi-weekly and monthly
> > releases, I believe that our thinking as you articulated should have a
> > clear understanding of strategic vs. tactical, but I don't believe our
> > execution needs to necessarily be either or, but can actually be blended.
> >
> > With that said,  I believe that there are the following four buckets that
> > we should use as a framework for Airflow 3.
> >
> > 1. Gen AI / LLM support
> > 2. Airflow User Improvements
> > 3. Easy adoption of Airflow by new users 4. Integration improvements /
> > Provider maintainability
> >
> > Describing them in more detail below:
> > 1. Gen AI / LLM support
> > Reiterating the fact that this needs more work, I do believe this can be
> > incremental to Airflow. As Astronomer, we have worked on the LLM
> Providers
> > which we contributed to Airflow late last year. But clearly, there is so
> > more to do, both from building awareness of the patterns / templates to
> > use, as well as patterns to support in Airflow to make these easier to
> use
> > and adopt.
> >
> > 2. Airflow User Improvements
> > Clearly features and improvements desired by the Community are important
> > to continue to work on to make Airflow more approachable. The top two
> > features which leap to mind for me here are:
> > 2.1 DAG Versioning - the most requested feature in the Airflow User
> Survey,
> > 2.2 Modern UI - also comes up a lot
> > 2.3 Different DAG distribution processes
> > 2.4 Different execution mechanisms
> > I know there are many more which I don't currently recall.
> >
> > 3. Airflow adoption
> > We have discussed this many times, but we absolutely need to make the
> > individual first-time adoption of Airflow better.
> > I think the most common term I recall here is the notion of "Airflow
> > Standalone", but whatever the term may be, an ultra quick, simpl

Re: [DISCUSS] DRAFT AIP-68 Extended Plugin Interface + AIP-69 Remote Executor

2024-04-20 Thread Jarek Potiuk
Hey Jens,

I looked at the AIPs when you created them and I very much like the
directions put there - but it also got me into a lot of thinking on the
future of Airflow and AIPs. See the thread I started
https://lists.apache.org/thread/3chvg9964zvh15mtrbl073f4oj3nlzp2  - about
Airflow 3.

I think in both cases (especially about AIP-689 but also AIP-69) - it would
make a wealth of difference if we treat them in the context of Airflow 2,
or (maybe) in the context of Airflow 3 which might start from taking the
best of Airflow 2 and get rid of all the unnecessary baggage it has. In the
past many similar efforts like AIP-69 had stalled in general because they
were far too complex to implement taking into account backwards
compatibility expectations of Airflow 2.

And I think it's the right time we should get to terms with the future of
Airflow - whether/to what extent we want Airflow 3 to come, what level of
compatibility it should have, which assumptions should be dropped. I
personally have a feeling that AIP-69 would have been way easier to bring
as one of Airflow 3 "foundational" AIP that could define the "new" remote
architecture of Airflow rather than "plugin" to existing one. Dropping
Celery & K8s Options, leaving the Remote + Local variant of it as the only
ones, without the direct DB communication channel we have now and replacing
it with smth else.

That would be my first comment on it and a question - should we get a bit
more clarity on what "future" of Airflow is before we discuss details and
approach there.

J.

On Fri, Apr 19, 2024 at 10:06 PM Scheffler Jens (XC-AS/EAE-ADA-T)
 wrote:

> Dear Dev-Community,
>
> mainly triggered by the deadline for CFP for Summit I dropped two “brand
> new” AIP’s as ideas that are running in my head for a longer time. Note
> these are DRAFT versions as a first write-up as solution concept and are
> lagging technical design and implementation yet.
>
> I’d kindly ask for feedback and review in Confluence cWiki (via Comments)
> – and also am seeking for people who like to join forces.
>
> AIP-68: Extended Plugin Interface for Custom Grid View Panels
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-68+Extended+Plugin+Interface+for+Custom+Grid+View+Panels
>
> Main motivation is that the UI has developed a lot in the recent time and
> AIP-38 is near completion. But it is focusing on technical details and logs
> – and for most business users it is hard to read, missing business
> perspective. I propose to extend the Plugin interface allowing
> customizations on various new levels such that customer specific business
> information can be embedded,
>
> AIP-69: Remote Executor
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-69+Remote+Executor
>
> Airflow can be deployed in cloud or on-prem and various options of
> deployment are possible. But it lags the option to easily form a secure and
> lean distributed setup for cases where individual nodes are far far away
> from the core of deployment. This imposes problems in opening firewalls and
> might raise risk of security. Therefore I propose to add a “Remote
> Executor” such that workload can easily distributed to remote locations –
> also with a chance that it is easier for cases where people want to
> distribute workload to Windows (yeah there are really people around who
> still have this).
>
> Looking forward for (constructive) feedback, discussion and opinions.
>
> Again, note: DRAFT means open to discussion, nothing fixed, nothing coded
> yet. Many implementation options possible – and in case of interest please
> join forces with me 
> Once the discussion is calming I’d call for a vote separately like usually.
>
> Mit freundlichen Grüßen / Best regards
>
> Jens Scheffler
>
> Alliance: Enabler - Tech Lead (XC-AS/EAE-ADA-T)
> Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> GERMANY | www.bosch.com
> Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> jens.scheff...@de.bosch.com
>
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> Geschäftsführung: Dr. Stefan Hartung, Dr. Christian Fischer, Dr. Markus
> Forschner,
> Stefan Grosch, Dr. Markus Heyn, Dr. Frank Meyer, Dr. Tanja Rückert
> ​
>


Team id as dag_id prefix: (was [VOTE] AIP-67)

2024-04-20 Thread Jarek Potiuk
Hey Ash and Daniel (and others unconvinced),

I split the discussion here so that we do not clutter the VOTE thread.

I hope I can convince you that yes, it makes sense to approach it this way
and that you withdraw the veto Ash. It needs a bit of context and the
separate discussion I started on Airflow 3 and Strategic vs. Tactical
approach started
https://lists.apache.org/thread/3chvg9964zvh15mtrbl073f4oj3nlzp2

For this particular case:

I would not state it better than what Ash himself already made as the
comment in the document:

"I worry what this does to our database schemas! We'd need to add
tennant_id to *almost every single* PK wouldn't we?"

Yes. I worry about it too. And I want to avoid making huge changes to the
whole Airflow with that. But also I think it should not be something we
should carry forward in that form.

This is mostly a "deployment" option that makes it easy to namespace and
isolate DAGs - which is likely not going to be needed at all if we redesign
Airflow 3 (if we go that route of course). So future-evolving approach with
a complete schema etc is really not a goal here. I treat it - explicitly
more as a band-aid than a long-term approach.

And I think what we are really looking at with AIP-67 (like with AIP-44 -
internal API and a number of others) is a tactical approach to add features
that users need for Airflow 2.

For me I see this AIP as a "tactical" approach, where we implement minimum
things needed to support the use case for Airflow 2, but we would not want
this to be carried over to what's coming next as possibly Airflow 3 - and
this is why I think it's acceptable to make it a "band-aid" kind of
approach.

And I think we should look at this decision with that as a context.

J.


On Fri, Apr 19, 2024 at 8:10 PM Daniel Standish
>  wrote:
>
> > >
> > > It doesn’t affect my vote on the API, but I am very strongly against
> this
> > > one part of the AIP:
> > > > … dag_id are namespaced with `:` prefix.
> > > This specific part is getting an implementation/code veto from me. We
> > made
> > > the mistake of overloading one column to store multiple things in
> Airflow
> > > before, and I’ve dealt with the fallout in other apps in the past.
> Trust
> > > me: do. not. do. this.
> >
> > I agree with Ash's sentiment.  Is adding a tenant_id or something so
> > unpalatable?
> >
>


[HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs strategic (Airflow 3) approach

2024-04-20 Thread Jarek Potiuk
Hello here,

I have been thinking a lot recently and discussing with some people and I
am more and more convinced it's about the time we - as a community - should
start doing changes considering "Airflow 2" current and "Airflow 3" future.


*TL;DR: I think we should seriously start work on Airflow 3 and decide what
it means for our AIPs  - to treat some of them as more "tactical" - things
that should go into Airflow 2 and some "strategic" ones - being
foundational for Airflow 3 - with different goals and criteria.*

A lot of us already think that way and a lot of us have already talked
about it for quite some time, so you should treat my mail mostly as a
little trigger "let's start publicly discussing what it might mean for us
and our community and let's make it clear about the target of the
initiatives we do".

Some might be surprised it comes from me as I've been often saying "no
Airflow 3 without a good reason" or "possibly we will have no Airflow 3",
but I think (and a number of people I spoke to have similar opinion) we
have plenty of reasons to make some bold moves now.

Over the last 4 years since Airflow 2 was out, a lot has changed and we
have a number of different needs that current Airflow 2 cannot **really**
do well

- LLM/Gen-AI mainly as the important trigger
- Cloud Native is the "way to go"
- need to submit DAGs in other ways than dropping them to a shared DAG
folder.
- local testing and fast iteration on developing pipelines.
- ability to run tasks with workflow with "affinity" so that they can share
inputs/outputs in shared CPU/GPU memory
- ability to integrate seamlessly with other workflow engines - making
Airflow a "workflow of workflows
- probably way more
- all that while keeping a lot of the strengths of Airflow 2 - such as
continuing to have the option of using the many thousands of operators with
90+ providers.

All those above - we could implement better if we get rid of a number of
the implicit or explicit luggage we have in Airflow 2. I think the last two
proposals from Jens: AIP-68 and AIP-69 reflect very much that - both  would
have been much easier and straightforward if we got Airflow 3 re-designed
basically at a drawing board with boldly dropping some Airflow 3
assumptions.
And if we implemented core airflow 3 - taking the best part of what we have
now in Airflow 2, but generally dropping the luggage  in a new framework.

And it won't be possible without breaking some fundamental assumptions and
making Airflow 3 quite heavily incompatible with Airflow 2

>From "my" camp - dropping the need of having the 700+ dependencies for
Airflow + all providers in a single Python interpreter, dropinnig
dependency on Flask/Plugins/FAB would be a huge win on its own. Not
mentioning being able to split provider's development and contribution from
airflow core (while keeping the development of providers as well and
contributions) - this has been highly requested.

And I think we have a lot of people in our community who would be able (and
would love) to do it - I think a number of us (including myself) are a bit
burned out and tired of just maintaining things in Airflow in a
backwards-compatible way and would jump on the opportunity to
rebuilding Airflow.

But - we of course cannot forget about Airflow 2 users. We do not want to
"stop the world" for them. We want to keep fixing things and adding
incremental changes - and those things do not necessarily super
"future-proof". They should help  to "keep the lights on" for a while -
which means that in a number of cases it could be "band-aid". AIP-44
(internal-API), AIP-67 (multi-team) are more of those.

So - what I think we might want to do as a community:

* start working on Airflow 3 foundations (and decide what it means for our
users and developer community). Decide what to keep, what to drop, what to
redesign, assumptions to recreate.

* explicitly split the initiatives/AIPs we have to target Airflow 2 and
Airflow 3 and treat them a bit differently in terms of future-proofness

I would love to hear your thoughts on that (bracing for the storm of those).

J.


Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

2024-04-18 Thread Jarek Potiuk
The change is merged, rebasing should trigger maintainers PRs using public
runners. they should be able to switch to "self-hosted" by "use self hosted
runners" label. The `main` and `v2-9-test` runs should still be run using
self-hosted runners.

I would love to hear back from the maintainers if that helps with their
experience.

On Thu, Apr 18, 2024 at 10:59 AM Jarek Potiuk  wrote:

> PR switching it here: https://github.com/apache/airflow/pull/39106 -
> sorry for the delay in following up on that one.
>
> J.
>
> On Fri, Apr 5, 2024 at 6:08 PM Wei Lee  wrote:
>
>> +1 for this. I do not yet have enough chance to experience many job
>> failures, but it won’t harm us to test them out. Plus, it saves some of the
>> cost.
>>
>> Best,
>> Wei
>>
>> > On Apr 5, 2024, at 11:36 PM, Jarek Potiuk  wrote:
>> >
>> > Seeing no big "no's" - I will prepare and run the experiment - starting
>> > some time next week, after we get 2.9.0 out - I do not want to break
>> > anything there. In the meantime, preparatory PR to add "use self-hosted
>> > runners" label is out https://github.com/apache/airflow/pull/38779
>> >
>> > On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar
>> >  wrote:
>> >
>> >> +1 with trying this out. I agree with keeping the canary builds
>> >> self-hosted in order to validate the usage for the PRs.
>> >>
>> >> -- Rajesh
>> >>
>> >>
>> >> From: Jarek Potiuk 
>> >> Reply-To: "dev@airflow.apache.org" 
>> >> Date: Friday, April 5, 2024 at 8:36 AM
>> >> To: "dev@airflow.apache.org" 
>> >> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
>> >> self-hosted runners for commiter PRs
>> >>
>> >>
>> >> CAUTION: This email originated from outside of the organization. Do not
>> >> click links or open attachments unless you can confirm the sender and
>> know
>> >> the content is safe.
>> >>
>> >>
>> >>
>> >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> externe.
>> >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
>> pouvez
>> >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
>> que
>> >> le contenu ne présente aucun risque.
>> >>
>> >>
>> >> Yeah. Valid concerns Hussein.
>> >>
>> >> And I am happy to share some more information on that. I did not want
>> to
>> >> put all of that in the original email, but I see that might be
>> interesting
>> >> for you and possibly others.
>> >>
>> >> I am closely following the numbers now. One of the reasons I am doing /
>> >> proposing it now is that finally (after almost 3 years of waiting) we
>> >> finally have access to some metrics that we can check. As of last week
>> I
>> >> got access to the ASF metrics (
>> >> https://issues.apache.org/jira/browse/INFRA-25662).
>> >>
>> >> I have access to "organisation" level information. Infra does not want
>> to
>> >> open it to everyone - even to every member -  but since I got very
>> active
>> >> and been helping with a number I got the access granted as an
>> exception.
>> >> Also I saw a small dashboard the INFRA prepares to open to everyone
>> once
>> >> they sort the access where we will be able to see the "per-project"
>> usage.
>> >>
>> >> Some stats that I can share (they asked not to share too much).
>> >>
>> >> From what I looked at I can tell that we are right now (the whole ASF
>> >> organisation) safely below the total capacity. With a large margin -
>> enough
>> >> to handle spikes, but of course the growth of usage is there and if
>> >> uncontrolled - we can again reach the same situation that triggered
>> getting
>> >> self-hosted runners a few years ago.
>> >>
>> >> Luckily - INRA gets it under control this time |(and metrics will
>> help).
>> >> In the last INFRA newsletter, they announced some limitations that will
>> >> apply to the projects (effective as of end of April) - so once those
>> will
>> >> be followed, we should be "safe" from being impacted by others (i.e.
>> >> noisy-neighbour effect). Some of the projects (

[VOTE] AIP-67 Multi-team deployment of Airflow components

2024-04-18 Thread Jarek Potiuk
Hello here.

I have not not heard a lot of feedback after my last update, so let me
start a vote, hoping that the last changes proposed addressed most of the
concerns.

Just to recap. the proposal is here:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components

Summarizing the most recent changes - responding to comments and doubts
raised during the discussion:

* renamed it to be multi-team to clarify that this is the only case it
addresses
* splitting it into two phases: without and with internal API AIP-44 (also
named as GRPC server)
* implifying the approach for Variables and Connections, where no changes
in Airflow will be needed to handle the DB updates.

This makes phase 1 simpler and not depending on AIP-44.

The vote will last till next Friday 26th of April 2024 Noon CEST.

J.


Re: [VOTE] Airflow Providers prepared on April 16, 2024

2024-04-18 Thread Jarek Potiuk
+1 (binding): checked reproducibility, licences, signatures, checksums -
all look good.

On Thu, Apr 18, 2024 at 12:27 PM Pankaj Singh 
wrote:

> +1 (non-binding) tested my changes.
>
> Thanks
> Pankaj
>
>
>
> On Wed, Apr 17, 2024 at 12:24 PM Amogh Desai 
> wrote:
>
> > +1 non binding
> >
> > Tested few example DAGs with cncf. Works fine.
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> >
> > On Wed, 17 Apr 2024 at 12:21 PM, Pankaj Koti
> >  wrote:
> >
> > > +1 (non-binding) Concurring with Wei.
> > >
> > >
> > > Best regards,
> > >
> > > *Pankaj Koti*
> > > Senior Software Engineer (Airflow OSS Engineering team)
> > > Location: Pune, Maharashtra, India
> > > Timezone: Indian Standard Time (IST)
> > >
> > >
> > > On Wed, Apr 17, 2024 at 9:01 AM Wei Lee  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > I tested my change and ran example DAGs with cncf, databircks
> providers
> > > > RCs, and it worked fine.
> > > >
> > > > Best,
> > > > Wei
> > > >
> > > > > On Apr 16, 2024, at 8:40 PM, Elad Kalif 
> wrote:
> > > > >
> > > > > Hey all,
> > > > >
> > > > > I have just cut the new wave Airflow Providers packages. This email
> > is
> > > > > calling a vote on the release,
> > > > > which will last for 72 hours - which means that it will end on
> April
> > > 19,
> > > > > 2024 12:40 PM UTC and until 3 binding +1 votes have been received.
> > > > >
> > > > >
> > > > > Consider this my (binding) +1.
> > > > >
> > > > > Airflow Providers are available at:
> > > > > https://dist.apache.org/repos/dist/dev/airflow/providers/
> > > > >
> > > > > *apache-airflow-providers--*.tar.gz* are the binary
> > > > > Python "sdist" release - they are also official "sources" for the
> > > > provider
> > > > > packages.
> > > > >
> > > > > *apache_airflow_providers_-*.whl are the binary
> > > > > Python "wheel" release.
> > > > >
> > > > > The test procedure for PMC members is described in
> > > > >
> > > >
> > >
> >
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-pmc-members
> > > > >
> > > > > The test procedure for and Contributors who would like to test this
> > RC
> > > is
> > > > > described in:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-contributors
> > > > >
> > > > >
> > > > > Public keys are available at:
> > > > > https://dist.apache.org/repos/dist/release/airflow/KEYS
> > > > >
> > > > > Please vote accordingly:
> > > > >
> > > > > [ ] +1 approve
> > > > > [ ] +0 no opinion
> > > > > [ ] -1 disapprove with the reason
> > > > >
> > > > > Only votes from PMC members are binding, but members of the
> community
> > > are
> > > > > encouraged to test the release and vote with "(non-binding)".
> > > > >
> > > > > Please note that the version number excludes the 'rcX' string.
> > > > > This will allow us to rename the artifact without modifying
> > > > > the artifact checksums when we actually release.
> > > > >
> > > > > The status of testing the providers by the community is kept here:
> > > > > https://github.com/apache/airflow/issues/39063
> > > > >
> > > > > The issue is also the easiest way to see important PRs included in
> > the
> > > RC
> > > > > candidates.
> > > > > Detailed changelog for the providers will be published in the
> > > > documentation
> > > > > after the
> > > > > RC candidates are released.
> > > > >
> > > > > You can find the RC packages in PyPI following these links:
> > > > >
> > > > >
> > > >
> > >
> >
> https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/8.1.1rc1/
> > > > >
> > https://pypi.org/project/apache-airflow-providers-databricks/6.3.0rc3/
> > > > > https://pypi.org/project/apache-airflow-providers-fab/1.0.4rc1/
> > > > >
> > > > > Cheers,
> > > > > Elad Kalif
> > > >
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > > For additional commands, e-mail: dev-h...@airflow.apache.org
> > > >
> > > >
> > >
> >
>


Re: [CALL FOR HELP] Help on Connexion 3 migration needed

2024-04-18 Thread Jarek Potiuk
Just to make it clear I marked this PR as Draft - and I would really
appreciate some comments about the approach and direction (in the light of
Connexion 2 being dead-end holding us from security-related upgrades and
generally holding us back).

Surely - there are things to be improved in this PR, and we already got
quite a deal of comments there (and yes, it's quite obvious it's not
mergeable in the current state - I made a number of comments and TODOs
there, but likely there will be more - there are already some).

And if we feel, we need a separate AIP for that one - we can definitely
think about writing and approving one - describing in more detail the
architectural changes introduced (which as I understand it, is mostly
following best directions of where Python apps of the kind are heading
towards with WSGI => ASGI move for quite a long time already). But mostly
it's a call for help from those who **could** help and move it into a "PROD
Ready" solution, if we think it's a good direction (and possibly help to
write the AIP).

Forming a small task force of people who would like to make it "good to go"
- people who know what they are doing about such change, and can bring some
more "testing" capacity - especially from those who host Airflow as managed
services.

Or maybe that could be one of the "internal" changes for the elusive
"Airflow 3" we've been talking about for such a long time but we were never
able to define what it is going to be? Dunno what others think :).
But it would be good to hear what others think, as this one is one of the
things that definitely holds us back.

J.



On Tue, Apr 16, 2024 at 10:36 PM Jarek Potiuk  wrote:

> I don't think AIP is needed (because it's not really a user facing change
> and generally the change is backwards compatible.
>
> But yeah - the call for help is mostly to see / review / discuss it when
> we really know the scope of the change after we have a PR that proves that
> yes - it can be done.
>
> It might indeed need voting when we get to the act of considering merging
> the change, but I am not sure if we need AIP describing it. We probably
> would not be able to write the AIP upfront - before attempting to migrate
> it to be honest.
>
> I think what I am really looking for is to achieve the level of confidence
> that might let us decide "yep it's good to go" (hopefully). The important
> thing here is while the stack under-the-hood is changing a lot, the actual
> number of changes in airflow code (except the tests) is rather small -
> mostly initialization and wiring the stack together.
>
> We do not have to merge it now or even soon or maybe even never (though
> Connexion 2 is a dead-end and sooner or later we will have to replace it
> with **something** when it comes to serving our API).
> Connexion 3 seems to be a good - and natural - candidate .
>
> I think with this change in this stage -  we more or less know what
> it really means to migrate to Connexion 3 (additionally with having some
> comments from the Connexion maintainer who already looks in more details at
> the comments and questions in the PR) I think it's where we can see if that
> is something we want to move forward with. And especially if we will
> have feedback from those who know more about the involved technology stack.
>
> That's why I mostly opened this thread :)
>
> J.
>
>
> On Tue, Apr 16, 2024 at 5:15 PM Ryan Hatter
>  wrote:
>
>> Does the scope of this PR warrant an AIP?
>>
>> On Tue, Apr 16, 2024 at 6:40 AM Jarek Potiuk  wrote:
>>
>> > Hello here,
>> >
>> > I have a kind request for help from maintainers (and other contributors
>> who
>> > are not maintainers) - on the Connexion 3 migration for Airflow. PR here
>> > (unfortunately - it's one big PR and cannot be split):
>> > https://github.com/apache/airflow/pull/39055.
>> >
>> > I would love some general comments on this - especially from those who
>> are
>> > more experts than me on those web frameworks - is it safe and ok to
>> > migrate, do we need to do some more testing on that? What do other
>> > maintainers think?
>> >
>> > This is not a "simple" change - it introduces a pretty fundamental
>> change
>> > in how our web app is handled - It changes from WSGI to ASGI interface
>> > (though we use gunicorn as WSGI). But it's also absolutely needed -
>> > because we already had some security issues connected with old
>> > dependencies (Werkzeug) - raised - and Connexion 3 migration seems to be
>> > the easiest way to get to the latest, maintained versions of the
>> > dependencies.
>> >
>> > That's why I'd really

Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

2024-04-18 Thread Jarek Potiuk
PR switching it here: https://github.com/apache/airflow/pull/39106 - sorry
for the delay in following up on that one.

J.

On Fri, Apr 5, 2024 at 6:08 PM Wei Lee  wrote:

> +1 for this. I do not yet have enough chance to experience many job
> failures, but it won’t harm us to test them out. Plus, it saves some of the
> cost.
>
> Best,
> Wei
>
> > On Apr 5, 2024, at 11:36 PM, Jarek Potiuk  wrote:
> >
> > Seeing no big "no's" - I will prepare and run the experiment - starting
> > some time next week, after we get 2.9.0 out - I do not want to break
> > anything there. In the meantime, preparatory PR to add "use self-hosted
> > runners" label is out https://github.com/apache/airflow/pull/38779
> >
> > On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar
> >  wrote:
> >
> >> +1 with trying this out. I agree with keeping the canary builds
> >> self-hosted in order to validate the usage for the PRs.
> >>
> >> -- Rajesh
> >>
> >>
> >> From: Jarek Potiuk 
> >> Reply-To: "dev@airflow.apache.org" 
> >> Date: Friday, April 5, 2024 at 8:36 AM
> >> To: "dev@airflow.apache.org" 
> >> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> >> self-hosted runners for commiter PRs
> >>
> >>
> >> CAUTION: This email originated from outside of the organization. Do not
> >> click links or open attachments unless you can confirm the sender and
> know
> >> the content is safe.
> >>
> >>
> >>
> >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> >> le contenu ne présente aucun risque.
> >>
> >>
> >> Yeah. Valid concerns Hussein.
> >>
> >> And I am happy to share some more information on that. I did not want to
> >> put all of that in the original email, but I see that might be
> interesting
> >> for you and possibly others.
> >>
> >> I am closely following the numbers now. One of the reasons I am doing /
> >> proposing it now is that finally (after almost 3 years of waiting) we
> >> finally have access to some metrics that we can check. As of last week I
> >> got access to the ASF metrics (
> >> https://issues.apache.org/jira/browse/INFRA-25662).
> >>
> >> I have access to "organisation" level information. Infra does not want
> to
> >> open it to everyone - even to every member -  but since I got very
> active
> >> and been helping with a number I got the access granted as an exception.
> >> Also I saw a small dashboard the INFRA prepares to open to everyone once
> >> they sort the access where we will be able to see the "per-project"
> usage.
> >>
> >> Some stats that I can share (they asked not to share too much).
> >>
> >> From what I looked at I can tell that we are right now (the whole ASF
> >> organisation) safely below the total capacity. With a large margin -
> enough
> >> to handle spikes, but of course the growth of usage is there and if
> >> uncontrolled - we can again reach the same situation that triggered
> getting
> >> self-hosted runners a few years ago.
> >>
> >> Luckily - INRA gets it under control this time |(and metrics will help).
> >> In the last INFRA newsletter, they announced some limitations that will
> >> apply to the projects (effective as of end of April) - so once those
> will
> >> be followed, we should be "safe" from being impacted by others (i.e.
> >> noisy-neighbour effect). Some of the projects (not Airflow (!) ) were
> >> exceeding those so far and they will be capped - they will need to
> optimize
> >> their builds eventually.
> >>
> >> Those are the rules:
> >>
> >> * All workflows MUST have a job concurrency level less than or equal to
> >> 20. This means a workflow cannot have more than 20 jobs running at the
> same
> >> time across all matrices.
> >> * All workflows SHOULD have a job concurrency level less than or equal
> to
> >> 15. Just because 20 is the max, doesn't mean you should strive for 20.
> >> * The average number of minutes a project uses per calendar week MUST
> NOT
> >> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200
> >> hours).

Re: [CALL FOR HELP] Help on Connexion 3 migration needed

2024-04-16 Thread Jarek Potiuk
I don't think AIP is needed (because it's not really a user facing change
and generally the change is backwards compatible.

But yeah - the call for help is mostly to see / review / discuss it when we
really know the scope of the change after we have a PR that proves that yes
- it can be done.

It might indeed need voting when we get to the act of considering merging
the change, but I am not sure if we need AIP describing it. We probably
would not be able to write the AIP upfront - before attempting to migrate
it to be honest.

I think what I am really looking for is to achieve the level of confidence
that might let us decide "yep it's good to go" (hopefully). The important
thing here is while the stack under-the-hood is changing a lot, the actual
number of changes in airflow code (except the tests) is rather small -
mostly initialization and wiring the stack together.

We do not have to merge it now or even soon or maybe even never (though
Connexion 2 is a dead-end and sooner or later we will have to replace it
with **something** when it comes to serving our API).
Connexion 3 seems to be a good - and natural - candidate .

I think with this change in this stage -  we more or less know what
it really means to migrate to Connexion 3 (additionally with having some
comments from the Connexion maintainer who already looks in more details at
the comments and questions in the PR) I think it's where we can see if that
is something we want to move forward with. And especially if we will
have feedback from those who know more about the involved technology stack.

That's why I mostly opened this thread :)

J.


On Tue, Apr 16, 2024 at 5:15 PM Ryan Hatter
 wrote:

> Does the scope of this PR warrant an AIP?
>
> On Tue, Apr 16, 2024 at 6:40 AM Jarek Potiuk  wrote:
>
> > Hello here,
> >
> > I have a kind request for help from maintainers (and other contributors
> who
> > are not maintainers) - on the Connexion 3 migration for Airflow. PR here
> > (unfortunately - it's one big PR and cannot be split):
> > https://github.com/apache/airflow/pull/39055.
> >
> > I would love some general comments on this - especially from those who
> are
> > more experts than me on those web frameworks - is it safe and ok to
> > migrate, do we need to do some more testing on that? What do other
> > maintainers think?
> >
> > This is not a "simple" change - it introduces a pretty fundamental change
> > in how our web app is handled - It changes from WSGI to ASGI interface
> > (though we use gunicorn as WSGI). But it's also absolutely needed -
> > because we already had some security issues connected with old
> > dependencies (Werkzeug) - raised - and Connexion 3 migration seems to be
> > the easiest way to get to the latest, maintained versions of the
> > dependencies.
> >
> > That's why I'd really like a few more maintainers - and people from the
> > Astronomer, Google and AWS to take it for a spin and help to test that
> > change and say "yep. It looks good, we can merge it".  I would especially
> > appreciate some more "scale" testing on it. It seems that performance and
> > resource usage is not affected and ASGI interface and uvicorn should
> nicely
> > replace all the different worker types we could have for gunicorn - but I
> > would love to have confirmation for that.
> >
> > The PR has been started by Vlada and Maks from the Google team - and
> > with the help of Sudipto and Satoshi - two interns from Major League
> > Hacking - supported by Royal Bank of Canada - finally we have a stable,
> > working version and green PR.  Airflow webserver + API seems to work
> well,
> > stable (and generally back-compatible) on both - development (local +
> > breeze) and PROD image.
> >
> > I took a mentorship and leading role on it - but personally I have been
> > learning on the go about WSGI/ASGI and all changes needed -  I am not an
> > expert at all in those. We followed the directions from Connexion'\s
> > maintainer Robb Snyders - and I asked him to help and comment on the PR
> in
> > a number of places - but  that's why I need more help and experts' eyes
> and
> > hands to be quite sure it can be safely nerged.
> >
> > I extracted it and squashed more than 100 commits on it into a single one
> > to make it easier to start new conversations.
> >
> > Once again PR here: https://github.com/apache/airflow/pull/39055
> >
> > Also - we need to decide when is the best time to merge the PR - it does
> > not introduce a lot of changes in the code of the app, but it changes a
> lot
> > of test code to make it compatible with Startlette test client - we can
> > continue rebasing it and fixing new changes for a short while - but I
> think
> > the sooner we migrate it - the better - it will give more time for
> testing
> > in the future MINOR airflow version.
> >
> > J.
> >
>


[CALL FOR HELP] Help on Connexion 3 migration needed

2024-04-16 Thread Jarek Potiuk
Hello here,

I have a kind request for help from maintainers (and other contributors who
are not maintainers) - on the Connexion 3 migration for Airflow. PR here
(unfortunately - it's one big PR and cannot be split):
https://github.com/apache/airflow/pull/39055.

I would love some general comments on this - especially from those who are
more experts than me on those web frameworks - is it safe and ok to
migrate, do we need to do some more testing on that? What do other
maintainers think?

This is not a "simple" change - it introduces a pretty fundamental change
in how our web app is handled - It changes from WSGI to ASGI interface
(though we use gunicorn as WSGI). But it's also absolutely needed -
because we already had some security issues connected with old
dependencies (Werkzeug) - raised - and Connexion 3 migration seems to be
the easiest way to get to the latest, maintained versions of the
dependencies.

That's why I'd really like a few more maintainers - and people from the
Astronomer, Google and AWS to take it for a spin and help to test that
change and say "yep. It looks good, we can merge it".  I would especially
appreciate some more "scale" testing on it. It seems that performance and
resource usage is not affected and ASGI interface and uvicorn should nicely
replace all the different worker types we could have for gunicorn - but I
would love to have confirmation for that.

The PR has been started by Vlada and Maks from the Google team - and
with the help of Sudipto and Satoshi - two interns from Major League
Hacking - supported by Royal Bank of Canada - finally we have a stable,
working version and green PR.  Airflow webserver + API seems to work well,
stable (and generally back-compatible) on both - development (local +
breeze) and PROD image.

I took a mentorship and leading role on it - but personally I have been
learning on the go about WSGI/ASGI and all changes needed -  I am not an
expert at all in those. We followed the directions from Connexion'\s
maintainer Robb Snyders - and I asked him to help and comment on the PR in
a number of places - but  that's why I need more help and experts' eyes and
hands to be quite sure it can be safely nerged.

I extracted it and squashed more than 100 commits on it into a single one
to make it easier to start new conversations.

Once again PR here: https://github.com/apache/airflow/pull/39055

Also - we need to decide when is the best time to merge the PR - it does
not introduce a lot of changes in the code of the app, but it changes a lot
of test code to make it compatible with Startlette test client - we can
continue rebasing it and fixing new changes for a short while - but I think
the sooner we migrate it - the better - it will give more time for testing
in the future MINOR airflow version.

J.


Re: [PROPOSAL] Keep >= LATEST MINOR for all providers using common.sql provider

2024-04-15 Thread Jarek Potiuk
Thanks to everyone who spoke.

Yeah. There are certain dangers of that approach. I think for now we have
not seen much of a problem with the current approach, but we might want to
address it differently - probably not by running a complete set of tests
for older versions (that's not really feasible), most likely by just
running import checks as we do currently in the compatibility checks by
providers.

Let me park this one for a while, and possibly we can come back to it
when/if we do the `--lowest ` resolution check for dependencies with UV -
which is something that should be addressed separately - but  fits the
general area to address - what should be the right lower binding set for
the dependencies.

J.


On Mon, Apr 15, 2024 at 8:41 AM Aritra Basu 
wrote:

> I tend to agree with Wei here, if process can fix the issue maybe it
> shouldn't go into code.
>
> --
> Regards,
> Aritra Basu
>
> On Mon, Apr 15, 2024, 9:42 AM Amogh Desai 
> wrote:
>
> > I agree with the points made by Andrey here.
> >
> > > End users use amazon provider and google provider, and do not use
> > common.sql and both of them have mandatory common.sql as dependency.
> >
> > On the contrary, would it be better to remove common.sql as mandatory
> > dependencies for
> > providers that do not use it? That way we would avoid one of the bigger
> > problems which is
> > maintaining the providers that have common.sql as dependency but do not
> > need it as and when we
> > someday deprecate common.sql?
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> >
> > On Sun, Apr 14, 2024 at 2:20 PM Wei Lee  wrote:
> >
> > > I feel like this is the responsibility of the provider contributor 樂
> > > Maybe we could check whether the PR contains common.sql usage and
> raise a
> > > warning?
> > >
> > > Best,
> > > Wei
> > >
> > > > On Apr 12, 2024, at 12:31 AM, Andrey Anshin <
> andrey.ans...@taragol.is>
> > > wrote:
> > > >
> > > > There are some drawbacks I saw here, it would force to upgrade other
> > > > providers to the latest version.
> > > > Some scenarios from the my head:
> > > >
> > > > End users use amazon provider and google provider, and do not use
> > > > common.sql and both of them have mandatory common.sql as dependency.
> > > > if everything works fine there is no problem, but it could be a
> > situation
> > > > that a new release of one of the providers is major or contains bugs
> > and
> > > > there is no possible way to downgrade one of them and keep a new
> > version
> > > of
> > > > another without breaking dependencies.
> > > >
> > > > Other case if we need to exclude common.sql provider from provider
> > > release
> > > > wave than we also need to exclude all providers which depends on
> > > common.sql
> > > > even if it problem not affected directly, e.g. it is not a sql only
> > > > provider, e.g. amazon, google, microsoft.azure
> > > >
> > > > If it is not a big deal I have no objections to adding this because I
> > do
> > > > not have a better solution which could be implemented "Here and Now".
> > > >
> > > > It could be resolved if we might run tests against released versions
> of
> > > > common.sql, but after some brief investigation to run tests
> > > > against installed versions and not a main version this might require
> > > > changing quite a few things, which might take a time.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, 11 Apr 2024 at 12:41, Jarek Potiuk   > > ja...@potiuk.com>> wrote:
> > > >
> > > >> Hello here,
> > > >>
> > > >> I have a proposal to add a general policy that all our providers
> > > depending
> > > >> on common.sql provider always use >= LATEST_MNOR version of the
> > > provider.
> > > >>
> > > >> For example, following the change here
> > > >> https://github.com/apache/airflow/pull/38707  by David we will
> update
> > > all
> > > >> sql providers to have: airflow-providers-common-sql >= 1.12. We
> could
> > of
> > > >> course automate that with pre-commit so that we do not have to
> > remember
> > > >> about it.
> > > >>
> > > >> Why ?
> > > >>
> > > >> B

Re: [VOTE] Airflow Providers prepared on April 13, 2024

2024-04-15 Thread Jarek Potiuk
+1 (binding) for yandex: checked signatures, licences, checksums,
reproducibility.

On Sun, Apr 14, 2024 at 2:51 PM Elad Kalif  wrote:

> databricks provider is excluded from RC2
> let the vote continue only for yandex provider
>
> On Sun, Apr 14, 2024 at 8:10 AM Pankaj Koti
>  wrote:
>
> > -1 (non-binding) for the Databricks provider. We would need to also
> include
> > PR https://github.com/apache/airflow/pull/38962 for the included PRs in
> > the
> > RC to work well.
> > Also added a
> > https://github.com/apache/airflow/issues/38997#issuecomment-2053904837
> on
> > the status issue.
> >
> >
> > On Sun, 14 Apr 2024, 02:07 Elad Kalif,  wrote:
> >
> > > Hey all,
> > >
> > > I have just cut *RC2* version for Airflow Providers packages. This
> email
> > is
> > > calling a vote on the release,
> > > which will last for *24 hours* - which means that it will end on April
> > 14,
> > > 2024 20:35 PM UTC and until 3 binding +1 votes have been received. This
> > is
> > > according to policy accepted in
> > > https://lists.apache.org/thread/cv194w1fqqykrhswhmm54zy9gnnv6kgm
> > >
> > >
> > > Consider this my (binding) +1.
> > >
> > > Airflow Providers are available at:
> > > https://dist.apache.org/repos/dist/dev/airflow/providers/
> > >
> > > *apache-airflow-providers--*.tar.gz* are the binary
> > >  Python "sdist" release - they are also official "sources" for the
> > provider
> > > packages.
> > >
> > > *apache_airflow_providers_-*.whl are the binary
> > >  Python "wheel" release.
> > >
> > > The test procedure for PMC members is described in
> > >
> > >
> >
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-pmc-members
> > >
> > > The test procedure for and Contributors who would like to test this RC
> is
> > > described in:
> > >
> > >
> >
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-contributors
> > >
> > >
> > > Public keys are available at:
> > > https://dist.apache.org/repos/dist/release/airflow/KEYS
> > >
> > > Please vote accordingly:
> > >
> > > [ ] +1 approve
> > > [ ] +0 no opinion
> > > [ ] -1 disapprove with the reason
> > >
> > > Only votes from PMC members are binding, but members of the community
> are
> > > encouraged to test the release and vote with "(non-binding)".
> > >
> > > Please note that the version number excludes the 'rcX' string.
> > > This will allow us to rename the artifact without modifying
> > > the artifact checksums when we actually release.
> > >
> > > The status of testing the providers by the community is kept here:
> > > https://github.com/apache/airflow/issues/38997
> > >
> > > The issue is also the easiest way to see important PRs included in the
> RC
> > > candidates.
> > > Detailed changelog for the providers will be published in the
> > documentation
> > > after the
> > > RC candidates are released.
> > >
> > > You can find the RC packages in PyPI following these links:
> > >
> > > https://pypi.org/project/apache-airflow-providers-databricks/6.3.0rc2/
> > > https://pypi.org/project/apache-airflow-providers-yandex/3.10.0rc2/
> > >
> > > Cheers,
> > > Elad Kalif
> > >
> >
>


[DISCUSS] Redis licencing changes impact

2024-04-11 Thread Jarek Potiuk
Hello here,

I've raised the discussion on private@ and it seems that there are no
private/controversies there, so I bring the discussion to devlist where it
belongs.

You might want to be aware that Redis announced licensing changes that make
future Redis 7.4+ releases not good for "mandatory" dependency of Airflow.
[1]. This is a very similar move to one that MongoDB, Elasticsearch and
Terraform did.

The RSAL licence is explicitly mentioned in the "Category -X" [2] of
3rd-party licence policy by the ASF as not good as "mandatory" dependency -
but still allowed as optional [3].

Redis client is MIT licensed and there are no plans to change that - so
technically speaking we do not have to limit it in any way.

My - personal - assessment is that we are not really affected - even with
our Celery Executor, rabbitmq is still a viable alternative for Redis (not
mentioning Local and Kubernetes executors), so redis is indeed an optional
feature.  In the future, someone could contribute Apache Qpid [4] (which is
supported by Kombu), also an easier one to contribute is to add Valkey [5]
as an alternative.

In the meantime I created https://github.com/apache/airflow/pull/38928 to
limit Redis image to 7-2-bookworm.

Also I'd love personally to see someone to contribute direct Valkey support
(that would be great I think) - but other than that I think there is not
much we have to do now.

But maybe others have different opinions and proposals (and would love to
follow up on it) ?

J.


[1] https://redis.io/blog/redis-adopts-dual-source-available-licensing/
[2] https://www.apache.org/legal/resolved.html#category-x
[3] https://www.apache.org/legal/resolved.html#optional
[4] https://qpid.apache.org
[5]
https://www.linuxfoundation.org/press/linux-foundation-launches-open-source-valkey-community


Re: [VOTE] Airflow Providers prepared on April 10, 2024

2024-04-11 Thread Jarek Potiuk
+1 (binding). Checked all my changes, including the celery provider bug
that prevented 3.6.1 from running on Airflow 2.7.* (celery 3.6.2rc1 now
works just fine for Airflow 2.7). Checked reproducibility, licences,
signatures, checksums. All good.

On Wed, Apr 10, 2024 at 8:11 PM Vincent Beck  wrote:

> +1 non binding. All AWS system tests are running successfully against
> apache-airflow-providers-amazon==8.20.0rc1 with the exception of
> example_bedrock that is failing due to a bug in the test itself (fix here:
> https://github.com/apache/airflow/pull/38887). You can see the results
> here:
> https://aws-mwaa.github.io/#/open-source/system-tests/version/3d804351aa7a875dfdba824c2b27300cc5ce9e92_8.20.0rc1.html
>
> On 2024/04/10 16:37:12 Elad Kalif wrote:
> > Hey all,
> >
> > I have just cut the new wave Airflow Providers packages. This email is
> > calling a vote on the release,
> > which will last for 72 hours - which means that it will end on April 13,
> > 2024 16:35 PM UTC and until 3 binding +1 votes have been received.
> >
> >
> > Consider this my (binding) +1.
> >
> > Airflow Providers are available at:
> > https://dist.apache.org/repos/dist/dev/airflow/providers/
> >
> > *apache-airflow-providers--*.tar.gz* are the binary
> >  Python "sdist" release - they are also official "sources" for the
> provider
> > packages.
> >
> > *apache_airflow_providers_-*.whl are the binary
> >  Python "wheel" release.
> >
> > The test procedure for PMC members is described in
> >
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-pmc-members
> >
> > The test procedure for and Contributors who would like to test this RC is
> > described in:
> >
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-contributors
> >
> >
> > Public keys are available at:
> > https://dist.apache.org/repos/dist/release/airflow/KEYS
> >
> > Please vote accordingly:
> >
> > [ ] +1 approve
> > [ ] +0 no opinion
> > [ ] -1 disapprove with the reason
> >
> > Only votes from PMC members are binding, but members of the community are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > Please note that the version number excludes the 'rcX' string.
> > This will allow us to rename the artifact without modifying
> > the artifact checksums when we actually release.
> >
> > The status of testing the providers by the community is kept here:
> > https://github.com/apache/airflow/issues/38904
> >
> > The issue is also the easiest way to see important PRs included in the RC
> > candidates.
> > Detailed changelog for the providers will be published in the
> documentation
> > after the
> > RC candidates are released.
> >
> > You can find the RC packages in PyPI following these links:
> >
> > https://pypi.org/project/apache-airflow-providers-airbyte/3.7.0rc1/
> > https://pypi.org/project/apache-airflow-providers-alibaba/2.7.3rc1/
> > https://pypi.org/project/apache-airflow-providers-amazon/8.20.0rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-beam/5.6.3rc1/
> >
> https://pypi.org/project/apache-airflow-providers-apache-cassandra/3.4.2rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-hive/8.0.0rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-spark/4.7.2rc1/
> > https://pypi.org/project/apache-airflow-providers-celery/3.6.2rc1/
> >
> https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/8.1.0rc1/
> > https://pypi.org/project/apache-airflow-providers-cohere/1.1.3rc1/
> > https://pypi.org/project/apache-airflow-providers-common-io/1.3.1rc1/
> > https://pypi.org/project/apache-airflow-providers-common-sql/1.12.0rc1/
> > https://pypi.org/project/apache-airflow-providers-databricks/6.3.0rc1/
> > https://pypi.org/project/apache-airflow-providers-dbt-cloud/3.7.1rc1/
> > https://pypi.org/project/apache-airflow-providers-docker/3.10.0rc2/
> >
> https://pypi.org/project/apache-airflow-providers-elasticsearch/5.3.4rc1/
> > https://pypi.org/project/apache-airflow-providers-fab/1.0.3rc1/
> > https://pypi.org/project/apache-airflow-providers-ftp/3.8.0rc2/
> > https://pypi.org/project/apache-airflow-providers-google/10.17.0rc1/
> > https://pypi.org/project/apache-airflow-providers-http/4.10.1rc1/
> >
> https://pypi.org/project/apache-airflow-providers-microsoft-azure/10.0.0rc1/
> >
> https://pypi.org/project/apache-airflow-providers-microsoft-psrp/2.6.1rc1/
> > https://pypi.org/project/apache-airflow-providers-odbc/4.5.0rc1/
> > https://pypi.org/project/apache-airflow-providers-openlineage/1.7.0rc1/
> > https://pypi.org/project/apache-airflow-providers-papermill/3.6.2rc1/
> > https://pypi.org/project/apache-airflow-providers-redis/3.6.1rc1/
> > https://pypi.org/project/apache-airflow-providers-samba/4.6.0rc2/
> > https://pypi.org/project/apache-airflow-providers-sftp/4.9.1rc1/
> > https://pypi.org/project/apache-airflow-providers-slack/8.6.2rc1/
> > 

[PROPOSAL] Keep >= LATEST MINOR for all providers using common.sql provider

2024-04-11 Thread Jarek Potiuk
Hello here,

I have a proposal to add a general policy that all our providers depending
on common.sql provider always use >= LATEST_MNOR version of the provider.

For example, following the change here
https://github.com/apache/airflow/pull/38707  by David we will update all
sql providers to have: airflow-providers-common-sql >= 1.12. We could of
course automate that with pre-commit so that we do not have to remember
about it.

Why ?

Because it's very easy by a provider to accidentally use a new feature in
common.sql without bumping the version. Current situation is like in the
image attached (thanks David), but we cannot be certain that the
"min-versions" there are "good"..

A bit more context:

Generally speaking - common.sql **should** be backwards compatible - like -
always. We should not make any changes in it's API (which is for-now
captured here:
https://github.com/apache/airflow/blob/main/airflow/providers/common/sql/hooks/sql.pyi
). And from time to time we add new things to common.sql that providers
might start using.

Example:

>From the https://lists.apache.org/thread/lzcpllwcgo7pc8g18l3b905kn8f9k4w8
thread the new "suppress_and_warn"  method is going to be added.

Currently we have no good mechanism to verify if the min version in
provider dependencies is really "good". For example when we add
`suppress_and_warn` today and someone starts using  `suppress_and_warn` in
the "presto" (for example) provider tomorrow, without bumping common-sql to
>= 1.12 - it will fail with 1.11 or earlier installed.

On the other hand - all the tests we run in `main` use the "latest" version
of the `common.sql` and all the constraints we produce also use the latest
version released, so we can be rather sure that the new providers actually
work well with the latest `common.sql` version. If there will be a bug
about compatibility (happened in the past), then it should be fixed by
fixing it in a new patchlevel of the `common.sql`.

So in-general it should be "safe" to add >= LATEST MINOR of common.sql in
all providers.

Should we do (and automate) it ? Any other comments / proposals maybe ?

Here is the current state of dependencies:


[image: image.png]


Re: [DISCUSS] Common.util provider?

2024-04-11 Thread Jarek Potiuk
Any other  ideas or suggestions here? Can someone explain how the
"polypill" approach would look like, maybe? How do we imagine this working?

Just to continue this discussion - another example.

Small thing that David wanted to add for changes in some sql providers:

@contextmanager
def suppress_and_warn(*exceptions: type[BaseException]):
"""Context manager that suppresses the given exceptions and logs a
warning message."""
try:
yield
except exceptions as e:
warnings.warn(f"Exception suppressed:
{e}\n{traceback.format_exc()}", category=UserWarning)


https://github.com/apache/airflow/pull/38707/files#diff-6e1b2f961cb951d05d66d2d814ef5f6d8f8bf8f43c40fb5d40e27a031fed8dd7R115

This is a small thing - but adding it in `airflow` is problematic - because
it will only be released in 1.10, so we cannot use it in providers if we
do.
Currently - since it is used in sql providers, I suggested using
`common.sql` for that code (and add >= 1.12 for common-sql-providers for
those providers that use it). And I will write a separate email about a
proposed versioning approach there.

Do we have a good proposal on how we can solve similar things in the
future?
Do we want it at all? It has some challenges - yes it DRY's the code but it
also introduces coupling.

J.


On Sun, Mar 10, 2024 at 6:21 PM Jarek Potiuk  wrote:

> Coming back to it - what about the "polypill" :)? What's different vs the
> "common.sql" way of doing it ? How do we think it can work ?
>
> On Thu, Feb 22, 2024 at 1:58 PM Jarek Potiuk  wrote:
>
>> > The symbolic link approach seems to disregard all the external
>> providers, unless I misunderstand it.
>>
>> Not really. It just does not make it easy for the external providers to
>> use it "fast".  They can still - if they want to just manually copy those
>> utils from the latest version of Airflow if they want to use it. Almost by
>> definition, those will be small, independent modules that can be simply
>> copied as needed by whoever releases external providers - and they are also
>> free to copy any older version if they want. That is a nice feature that
>> makes them fully decoupled from the version of Airflow they are installed
>> in (same as community providers). Or - if they want they can just import
>> them from "airflow.provider_utils" - but then they have to add >= Airflow
>> 2.9 if that util appeared in Airflow 2.9 (which is the main reason we want
>> to use symbolic links - because due to our policies and promises, we do not
>> want community providers to depend on latest version of Airflow in vast
>> majority of cases.
>>
>> So this approach is also fully usable by external providers, but it
>> requires some manual effort to copy the modules to their providers.
>>
>> > I like the polypill idea. A backport provider that brings new
>> interfaces to providers without the actual functionalities.
>>
>> I would love to hear more about this, I think the "common.util" was
>> exactly the kind of polyfill approach (with its own versioning
>> complexities) but maybe I do not understand how such a polyfill provider
>> would work. Say we want to add a new "urlparse" method usable for all
>> providers. Could you explain how it would work - say:
>>
>> * we add "urlparse" in Airflow 2.9
>> * some provider wants to use it in Airflow 2.7
>>
>> What providers, with what code/interfaces we would have to release in
>> this case and what dependencies such providers that want to use it (both
>> community and Airflow should have)? I **think** that would mean exactly the
>> "common." approach we already have with "io" and "sql", but
>> maybe I do not understand it :)
>>
>> On Thu, Feb 22, 2024 at 1:45 PM Tzu-ping Chung 
>> wrote:
>>
>>> I like the polypill idea. A backport provider that brings new interfaces
>>> to providers without the actual functionalities.
>>>
>>>
>>> > On 22 Feb 2024, at 20:41, Maciej Obuchowski 
>>> wrote:
>>> >
>>> >> That's why I generally do
>>> > not like the "util" approach because common packaging introduces
>>> > unnecessary coupling (you have to upgrade independent utils together).
>>> >
>>> > From my experience with releasing OpenLineage where we do things
>>> similarly:
>>> > I think that's the advantage of it, but only _if_ you can release those
>>> > together.
>>> > With "build-in" providers it makes sense, but could be burdensome if
>>&

Re: [DISCUSS] DRAFT AIP-67 Multi-tenant deployment of Airflow components

2024-04-08 Thread Jarek Potiuk
Hello here.

I applied a number of comments mostly from TP, Ash, Daniel who were
concerned about the scope of the AIP (including the DB changes) but Also
Shubham, Nikolas and others.

The small (but most visible) change  where "multi-tenant" might promise too
much. I also talked to at least a few users (including some big bank whom I
spoke to the previous week and they explained how they implemented
something similar in their deployments). The last one actually made me
assured that if we make it easier for the users who have a need to support
multiple teams within their organization, and explain to them how they can
do it - and provide workload, but also - eventually - DB isolation, it
might be really useful deployment options for such users.

I updated the document today, hoping that all the feedback is addressed.
The main changes that I implemented:

* The name is changed to "Multi-team deployment of Airflow" and the
"--tenant" flag is changed to "--team" flag across the board.  I also -
after some thoughts - think it's genuinely better name as it avoids a lot
of ambiguities connected with multi-tenancy. I think different people have
different understanding of what "multi-tenancy" is, so it's better to be as
precise as possible about what we are proposing to our users.
* Removed Connections/Variables from the scope altogether. I realized that
we can drop all of it, by simply disabling the access to metadata DB (and
UI access) for those, and only rely on Secrets Managers (separately
configured per each team). That will cut down the scope a lot, it also
nicely solves the problem where each team might have different providers
installed when it comes to "connections" UI - the Connections / Variables
screens will be gone in a multi-team environment (BTW. That's what I
learned from the big bank - they modified their airflow to remove metadata
db Connection/Variables access).
* I proposed to introduce the AIP in two phases - first without the DB
isolation, second - with DB isolation. I think even phase 1 will be useful
for the users and it will be way simpler to implement and test. But there
is still value for the DB isolation, and I would like to keep it as a
single AIP to vote on as only then we can have the right "Security"
perimeter in.
* I proposed to add a very simple Dataset access control mechanism -
following today's implementation, allow anyone to produce any dataset
events, but only limit triggering by default to the same "Team" allows DAG
authors to specify which "teams" can trigger the DAG via dataset.
* I renamed "Internal API" to "GRPC API" - proposed by Daniel in one of the
recent PRs, I think it's a better name and I started to use it everywhere.

I believe that goes quite far in addressing all the concerns raised. I
would love to start a vote on it by the end of the week unless there are
serious concerns.

Feel free to comment here - or in the document.

J.




On Tue, Mar 12, 2024 at 12:05 AM Jarek Potiuk  wrote:

> I have iterated and already got a LOT of comments from a LOT of people
> (Thanks everyone who spent time on it ). I'd say the document is out of
> draft already, it very much describes the idea of multi-tenancy that I hope
> we will be voting on some time in the future.
>
> Taking into account that ~ 30% of people in our survey said they want
> "mutl-tenancy" -  what I am REALLY interested in is to get honest feedback
> about the proposal. Manly:
>
> **"Is this the multi-tenancy you were looking for?" *
>
> Or were you looking for different droids (err, tenants) maybe?.
>
> I do not want to exercise my Jedi skills to influence your opinion, that's
> why the document is there (and some people say it's nice, readable and
> pretty complete) so that you can judge yourself and give feedback.
>
> The document is here:
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-tenant+deployment+of+Airflow+components
>
>
> Feel free to comment here, or in the document. I would love to hear more
> voices, and have some ideas what to do next to validate the idea, so please
> - engage for now - but also expect some follow-ups.
>
> J.
>
>
> On Wed, Mar 6, 2024 at 9:16 AM Jarek Potiuk  wrote:
>
>> Sooo.. Seems that it's an AIP time :D I've just published a Draft of
>> AIP-67:
>>
>> Multi-tenant deployment of Airflow components
>>
>>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BDRAFT%5D+AIP-67+Multi-tenant+deployment+of+Airflow+components
>>
>> This AIP  is a bit lighter in detail than the others you could see
>> from Jed , Nikolas and Maciej. This is really a DRAFT / High Level
>> idea of Multi-Tenancy that could be implemented as the follow-up after
>> previous steps of Multi-Tenancy

Re: [DISCUSS] Asynchronous SQLAlchemy

2024-04-08 Thread Jarek Potiuk
Yep. If we can make both Postgres and MySQL work with Async - I am also all
for the "All" approach. If it means that we need to support only certain
drivers and certain versions of the DBs - so be it. As mentioned in my
original comments (long time ago when we had MSSQL support) - this was not
really possible back then - but now, by getting rid of Mssql and if we have
the right drivers for mysql, it should be possible - I guess.

On Mon, Apr 8, 2024 at 8:17 PM Daniel Standish
 wrote:

> I wholeheartedly agree with Ash that it should be all or nothing.  And
> *all* sounds
> better to me :)
>
>
>
> On Mon, Apr 8, 2024 at 10:54 AM Ash Berlin-Taylor  wrote:
>
> > I’m all in favour of async SQLAlchemy. We’ve built two products
> > exclusively at @ Astronomer that use sqlalchemy+psycopg3+async and love
> it.
> > Async does take a bit of a learning curve, but SQLA has done it nicely
> and
> > it works really well.
> >
> > I think this needs to be an all or nothing thing — having to maintain
> sync
> > and async versions of functions/features is a non-starter in my mind;
> it’d
> > just be a worryingly large amount of duplicated work. Given the only DBs
> we
> > support now is postgres and mysql then I can’t think of any reason users
> > should even care — they give it a DSN and that’s the end of their
> > involvement.
> >
> > Amogh: I don’t understand what you mean by point 3 below.
> >
> > -ash
> >
> > > On 8 Apr 2024, at 05:31, Amogh Desai  wrote:
> > >
> > > I checked the content and the PR that you attached.
> > >
> > > The results do seem promising and I like the general idea of this
> > approach.
> > > But as Jarek
> > > also mentioned on the PR:
> > >
> > > 1. Not everyone might be on the board to go all async due to certain
> > > limitations around
> > > access to the drivers, or corporate limitations. So, we definitely
> need a
> > > way to opt-out
> > > for the ones who aren't interested.
> > >
> > > 2. We should have a seamless fallback to sync if async doesn't work for
> > > whatever reasons.
> > >
> > > 3. Are we going all in or are we limiting the scope to lets say
> > > connections + variables and expanding
> > > based on the results in the long term?
> > >
> > > Looking forward to improvements async can bring in!
> > >
> > > Thanks & Regards,
> > > Amogh Desai
> > >
> > >
> > > On Sun, Apr 7, 2024 at 3:13 AM Hussein Awala  wrote:
> > >
> > >> The Metadata Database is the brain of Airflow, where all scheduling
> > >> decisions, cross-communication, synchronization between components,
> and
> > >> management via the web server, are made using this database.
> > >>
> > >> One option to optimize the DB queries is to merge many into a single
> > query
> > >> to reduce latency and overall time, but this is not always possible
> > because
> > >> the queries are sometimes completely independent, and it is
> > impossible/too
> > >> complicated to merge them. But in this case, we have another option
> > which
> > >> is running them concurrently since they are independent. The only way
> > to do
> > >> this currently is to use multithreading (the sync_to_async decorator
> > >> creates a thread and waits for it using an asyncio coroutine), which
> is
> > >> already a good start, but by using the asyncio extension for
> sqlalchemy
> > we
> > >> will be able to create thousands of lightweight coroutines with the
> same
> > >> amount of resources as a few threads, which will also help to reduce
> > >> resources consumption.
> > >>
> > >> A few months ago I started a PoC to add support for this extension and
> > >> implement an asynchronous version of connections and variables to be
> > able
> > >> to get/set them from triggers without blocking the event loop and
> > affecting
> > >> the performance of the triggerer, and the result was impressive (
> > >> https://github.com/apache/airflow/pull/36504).
> > >>
> > >> I see a good opportunity to improve the performance of our REST API
> and
> > web
> > >> server (for example https://github.com/apache/airflow/issues/38776),
> > >> knowing that we can mix sync and async endpoints, which will help for
> a
> > >> smooth migration.
> > >>
> > >> I also think that it will be possible (and very useful) to migrate
> some
> > of
> > >> our executors to a full asynchronous version to improve their
> > performance
> > >> (kubernetes and celery)
> > >>
> > >> I use the sqlalchemy asyncio extension in many personal and company
> > >> projects, and I'm very happy with it, but I would like to hear from
> > others
> > >> if they have any positive or negative feedback about it.
> > >>
> > >> I will create a new AIP for integrating the asyncio extension of
> > >> sqlaclhemy, and other following AIPs to migrate/support each component
> > once
> > >> the first one is implemented, but first, I prefer to check what the
> > >> community and other committers think about this integration.
> > >>
> >
> >
> > -
> > To 

Re: [ANNOUNCE] New committer: Wei Lee

2024-04-08 Thread Jarek Potiuk
Congrats Wei! Indeed, well deserved!

On Mon, Apr 8, 2024 at 10:54 AM Hussein Awala  wrote:

> Congrats Wei, well deserved!
>
> On Monday, April 8, 2024, Ash Berlin-Taylor  wrote:
>
> > The Project Management Committee (PMC) for Apache Airflow has invited Wei
> > Lee to become a committer and
> > we are pleased to announce that they have accepted.
> >
> > Wei has been contributing for a number of months he also participated a
> > lot in discussions and decisions
> > on many aspects of Airflow but also helps a lot our users and
> contributors
> > on Slack, Github, Discussions.
> >
> > I am looking forward to what the future holds with Wei becoming the
> > committer,
> >
> > Congratulations Jens, and welcome onboard!
> >
> > Being a committer enables easier contribution to the project since there
> > is no need to go via the patch
> > submission process. This should enable better productivity. A PMC member
> > helps manage and guide
> > the direction of the project.
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
> >
>


Re: [VOTE] Release Airflow 2.9.0 from 2.9.0rc3

2024-04-07 Thread Jarek Potiuk
+1 (binding): checked reproducibility, licences, signatures, checksums, run
a few dags, installed with celery executor and verified that configuration
is properly displayed in UI (comparing with 2.9.0rc2)

On Sun, Apr 7, 2024 at 8:16 AM Ephraim Anierobi 
wrote:

> Hey fellow Airflowers,
>
> I have cut Airflow 2.9.0rc3. This email is calling a vote on the release,
> which will last at least 26 hours, from Sunday, April 7, 2024 at 06:15 am
> UTC
> until Monday, April 8, 2024, at 8:15 am UTC
> <
> https://www.timeanddate.com/worldclock/fixedtime.html?msg=8=20240408T0815=1440
> >,
> and until 3 binding +1 votes have been received.
>
> Consider this my (binding) +1.
>
> Airflow 2.9.0rc3 is available at:
> https://dist.apache.org/repos/dist/dev/airflow/2.9.0rc3/
>
> *apache-airflow-2.9.0-source.tar.gz* is a source release that comes with
> INSTALL instructions.
> *apache-airflow-2.9.0.tar.gz* is the binary Python "sdist" release.
> *apache_airflow-2.9.0-py3-none-any.whl* is the binary Python wheel "binary"
> release.
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but all members of the community
> are encouraged to test the release and vote with "(non-binding)".
>
> The test procedure for PMC members is described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md\#verify-the-release-candidate-by-pmc-members
>
> The test procedure for and Contributors who would like to test this RC is
> described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md\#verify-the-release-candidate-by-contributors
>
>
> Please note that the version number excludes the `rcX` string, so it's now
> simply 2.9.0. This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release.
>
> Release Notes:
> https://github.com/apache/airflow/blob/2.9.0rc3/RELEASE_NOTES.rst
>
> For information on what goes into a release please see:
>
> https://github.com/apache/airflow/blob/main/dev/WHAT_GOES_INTO_THE_NEXT_RELEASE.md
>
> Changes since RC2:
>
> *Bugs*:
> - Load providers configuration when gunicorn workers start (#38795)
>
> Cheers,
> Ephraim
>


Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

2024-04-05 Thread Jarek Potiuk
Seeing no big "no's" - I will prepare and run the experiment - starting
some time next week, after we get 2.9.0 out - I do not want to break
anything there. In the meantime, preparatory PR to add "use self-hosted
runners" label is out https://github.com/apache/airflow/pull/38779

On Fri, Apr 5, 2024 at 4:21 PM Bishundeo, Rajeshwar
 wrote:

> +1 with trying this out. I agree with keeping the canary builds
> self-hosted in order to validate the usage for the PRs.
>
> -- Rajesh
>
>
> From: Jarek Potiuk 
> Reply-To: "dev@airflow.apache.org" 
> Date: Friday, April 5, 2024 at 8:36 AM
> To: "dev@airflow.apache.org" 
> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> self-hosted runners for commiter PRs
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
> Yeah. Valid concerns Hussein.
>
> And I am happy to share some more information on that. I did not want to
> put all of that in the original email, but I see that might be interesting
> for you and possibly others.
>
> I am closely following the numbers now. One of the reasons I am doing /
> proposing it now is that finally (after almost 3 years of waiting) we
> finally have access to some metrics that we can check. As of last week I
> got access to the ASF metrics (
> https://issues.apache.org/jira/browse/INFRA-25662).
>
> I have access to "organisation" level information. Infra does not want to
> open it to everyone - even to every member -  but since I got very active
> and been helping with a number I got the access granted as an exception.
> Also I saw a small dashboard the INFRA prepares to open to everyone once
> they sort the access where we will be able to see the "per-project" usage.
>
> Some stats that I can share (they asked not to share too much).
>
> From what I looked at I can tell that we are right now (the whole ASF
> organisation) safely below the total capacity. With a large margin - enough
> to handle spikes, but of course the growth of usage is there and if
> uncontrolled - we can again reach the same situation that triggered getting
> self-hosted runners a few years ago.
>
> Luckily - INRA gets it under control this time |(and metrics will help).
> In the last INFRA newsletter, they announced some limitations that will
> apply to the projects (effective as of end of April) - so once those will
> be followed, we should be "safe" from being impacted by others (i.e.
> noisy-neighbour effect). Some of the projects (not Airflow (!) ) were
> exceeding those so far and they will be capped - they will need to optimize
> their builds eventually.
>
> Those are the rules:
>
> * All workflows MUST have a job concurrency level less than or equal to
> 20. This means a workflow cannot have more than 20 jobs running at the same
> time across all matrices.
> * All workflows SHOULD have a job concurrency level less than or equal to
> 15. Just because 20 is the max, doesn't mean you should strive for 20.
> * The average number of minutes a project uses per calendar week MUST NOT
> exceed the equivalent of 25 full-time runners (250,000 minutes, or 4,200
> hours).
> * The average number of minutes a project uses in any consecutive five-day
> period MUST NOT exceed the equivalent of 30 full-time runners (216,000
> minutes, or 3,600 hours).
> * Projects whose builds consistently cross the maximum use limits will
> lose their access to GitHub Actions until they fix their build
> configurations.
>
> Those numbers on their own do not tell much, but we can easily see what
> they mean when we put them side-by-side t with "our" current numbers.
>
> * Currently - with all the "public" usage we are at 8 full-time runners.
> This is after some of the changes I've done, With the recent changes I
> already moved a lot of the non-essential build components that do not
> require a lot of parallelism to public runners.
> * The 20/15 jobs limit is a bit artificial (not really enforceable on
> workflow level) - but in our case as I optimized most PR to run just a
> subset of the tests, The average will be way below that - no matter if you
> are committer or not, regular PRs are far smaller subset of the jobs than
> full "canary" build. And for canary builds we should stay - at least for
> n

Re: [DISCUSS] Consider disabling self-hosted runners for commiter PRs

2024-04-05 Thread Jarek Potiuk
> Although 900 runners seem like a lot, they are shared among the Apache
> organization's 2.2k repositories, of course only a few of them are active
> (let's say 50), and some of them use an external CI tool for big jobs (eg:
> Kafka uses Jenkins, Hudi uses Azure pipelines), but we have other very
> active repositories based entirely on GHA, for example, Iceberg, Spark,
> Superset, ...
>
> I haven't found the AFS runners metrics dashboard to check the max
> concurrency and the max queued time during peak hours, but I'm sure that
> moving Airflow committers' CI jobs to public runners will put some pressure
> on these runners, especially since these committers are the most active
> contributors to Airflow, and the 35 self-hosted runners (with 8 CPUs and 64
> GB RAM) are used almost all the time, so we can say that we will need
> around 70 AFS runners to run the same jobs.
>
> There is no harm in testing and deciding after 2-3 weeks.
>
> We also need to find a way to let the infra team help us solve the
> connectivity problem with the ARC runners
> <
> https://issues.apache.org/jira/projects/INFRA/issues/INFRA-25117?filter=reportedbyme
> >
> .
>
> +1 for testing what you propose.
>
> On Fri, Apr 5, 2024 at 12:07 PM Amogh Desai 
> wrote:
>
> > +1 I like the idea.
> > Looking forward to seeing the difference.
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> >
> > On Fri, Apr 5, 2024 at 3:54 AM Ferruzzi, Dennis
> > 
> > wrote:
> >
> > > Interested in seeing the difference, +1
> > >
> > >
> > >  - ferruzzi
> > >
> > >
> > > 
> > > From: Oliveira, Niko 
> > > Sent: Thursday, April 4, 2024 2:00 PM
> > > To: dev@airflow.apache.org
> > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > > self-hosted runners for commiter PRs
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > +1I'd love to see this as well.
> > >
> > > In the past, stability and long queue times of PR builds have been very
> > > frustrating. I'm not 100% sure this is due to using self hosted
> runners,
> > > since 35 queue depth (to my mind) should be plenty. But something about
> > > that setup has never seemed quite right to me with queuing. Switching
> to
> > > public runners for a while to experiment would be great to see if it
> > > improves.
> > >
> > > 
> > > From: Pankaj Koti 
> > > Sent: Thursday, April 4, 2024 12:41:02 PM
> > > To: dev@airflow.apache.org
> > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Consider disabling
> > > self-hosted runners for commiter PRs
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> > know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> > que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > +1 from me to this idea.
> > >
> > > Sounds very reasonable to me.
> > > At times, my experience has been better with public runners instead of
> > > self-hosted runners :)
> > >
> > > And like already mentioned in the discussion, I think having the
> ability
> > of
> > > a applying the label "use-self-hosted-runners" to be used for critical
> > > times would be nice to have too.
> > >
> > >
> > > On Fri, 5 Apr 2024, 00:50 Jarek Potiuk,  wrote:
> > >
> > > > Hello everyone,
> > > >
> > > > TL;DR With some recent changes in GitHub A

Re: [VOTE] Release Airflow 2.9.0 from 2.9.0rc2

2024-04-04 Thread Jarek Potiuk
+1 (binding) - checked reproducibility, signatures, checksums, licences - >
all good. Installed it, run a few dags, clicked through a number of
screens. All looks good. Also verified the final package and it looks good
with the right FAB >=1.0.2 dependency. All looks good.

On Thu, Apr 4, 2024 at 11:25 PM Ephraim Anierobi 
wrote:

> Hey fellow Airflowers,
>
> I have cut Airflow 2.9.0rc2. This email is calling a vote on the release,
> which will last at least 52 hours, from Thursday, April 4, 2024, at 9:00 pm
> UTC
> until Sunday, April 7, 2024, at 1:00 am UTC
> <
> https://www.timeanddate.com/worldclock/fixedtime.html?msg=8=20240407T0100=1440
> >,
> and until 3 binding +1 votes have been received.
>
> Consider this my (binding) +1.
>
> Airflow 2.9.0rc2 is available at:
> https://dist.apache.org/repos/dist/dev/airflow/2.9.0rc2/
>
> *apache-airflow-2.9.0-source.tar.gz* is a source release that comes with
> INSTALL instructions.
> *apache-airflow-2.9.0.tar.gz* is the binary Python "sdist" release.
> *apache_airflow-2.9.0-py3-none-any.whl* is the binary Python wheel "binary"
> release.
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but all members of the community
> are encouraged to test the release and vote with "(non-binding)".
>
> The test procedure for PMC members is described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md\#verify-the-release-candidate-by-pmc-members
>
> The test procedure for and Contributors who would like to test this RC is
> described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md\#verify-the-release-candidate-by-contributors
>
>
> Please note that the version number excludes the `rcX` string, so it's now
> simply 2.9.0. This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release.
>
> Release Notes:
> https://github.com/apache/airflow/blob/2.9.0rc2/RELEASE_NOTES.rst
>
> For information on what goes into a release please see:
>
> https://github.com/apache/airflow/blob/main/dev/WHAT_GOES_INTO_THE_NEXT_RELEASE.md
>
> Changes since 2.9.0rc1:
>
> *Bug Fixes*:
> - Fix decryption of trigger kwargs when downgrading (#38743)
> - Fix grid header rendering (#38720)
>
> *Doc-only Change*:
> - Improve timetable documentation (#38505)
> - Reorder OpenAPI Spec tags alphabetically (#38717)
>
> Cheers,
> Ephraim
>


[DISCUSS] Consider disabling self-hosted runners for commiter PRs

2024-04-04 Thread Jarek Potiuk
Hello everyone,

TL;DR With some recent changes in GitHub Actions and the fact that ASF has
a lot of runners available donated for all the builds, I think we could
experiment with disabling "self-hosted" runners for committer builds.

The self-hosted runners of ours have been extremely helpful (and we should
again thank Amazon and Astronomer for donating credits / money for those) -
when the Github Public runners have been far less powerful - and we had
less number of those available for ASF projects. This saved us a LOT of
troubles where there was a contention between ASF projects.

But as of recently both limitations have been largely removed:

* ASF has 900 public runners donated by GitHub to all projects
* Those public runners have (as of January) for open-source projects now
have 4 CPUS and 16GB of memory -
https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/


While they are not as powerful as our self-hosted runners, the parallelism
we utilise for those brings those builds in not-that bad shape compared to
self-hosted runners. Typical differences between the public and self-hosted
runners now for the complete set of tests are ~ 20m for public runners and
~14 m for self-hosted ones.

But this is not the only factor - I think committers experience the "Job
failed" for self-hosted runners generally much more often than
non-committers (stability of our solution is not best, also we are using
cheaper spot instances). Plus - we limit the total number of self-hosted
runners (35) - so if several committers submit a few PRs and we have canary
build running, the jobs will wait until runners are available.

And of course it costs the credits/money of sponsors which we could use for
other things.

I have - as of recently - access to Github Actions metrics - and while ASF
is keeping an eye and stared limiting the number of parallel jobs workflows
in projects are run, it looks like even if all committer runs are added to
the public runners, we will still cause far lower usage that the limits are
and far lower than some other projects (which I will not name here).  I
have access to the metrics so I can monitor our usage and react.

I think possibly - if we switch committers to "public" runners by default
-the experience will not be much worse for them (and sometimes even better
- because of stability/limited queue).

I was planning this carefully - I made a number of refactors/changes to our
workflows recently that makes it way easier to manipulate the configuration
and get various conditions applied to various jobs - so
changing/experimenting with those settings should be - well - a breeze :).
Few recent changes had proven that this change and workflow refactor were
definitely worth the effort, I feel like I finally got a control over it
where previously it was a bit like herding a pack of cats (which I
brought to live by myself, but that's another story).

I would like to propose to run an experiment and see how it works if we
switch committer PRs back to the public runners - leaving the self-hosted
runners only for canary builds (which makes perfect sense because those
builds run a full set of tests and we need as much speed and power there as
we can.

This is pretty safe, We should be able to switch back very easily if we see
problems. I will also monitor it and see if our usage is within the limits
of the ASF. I can also add the feature that committers should be able to
use self-hosted runners by applying the "use self-hosted runners" label to
a PR.

Running it for 2-3 weeks should be enough to gather experience from
committers - whether things will seem better or worse for them - or maybe
they won't really notice a big difference.

Later we could consider some next steps - disabling the self-hosted runners
for canary builds if we see that our usage is low and build are fast
enough, eventually possibly removing current self-hosted runners and
switching to a better k8s based infrastructure (which we are close to do
but it makes it a bit difficult while current self-hosted solution is so
critical to keep it running (like rebuilding the plane while it is flying).
I'd love to do it gradually in the "change slowly and observe" mode -
especially now that I have access to "proper" metrics.

WDYT?

J.


[ANNOUNCE] Apache Airflow Providers prepared on March 25, 2024 are released

2024-04-03 Thread Jarek Potiuk
Dear Airflow community,

I'm happy to announce that new versions of Airflow Providers packages
prepared on March 25. 2024
were just released. Full list of PyPI packages released is added at the end
of the message.

The source release, as well as the binary releases, are available here:

https://airflow.apache.org/docs/apache-airflow-providers/installing-from-sources

You can install the providers via PyPI:
https://airflow.apache.org/docs/apache-airflow-providers/installing-from-pypi

The documentation is available at https://airflow.apache.org/docs/ and
linked from the PyPI packages.



Full list of released PyPI packages:

https://pypi.org/project/apache-airflow-providers-fab/1.0.2/

Cheers,

J


[RESULT][VOTE] Airflow Providers - release of March 25, 2024

2024-04-03 Thread Jarek Potiuk
Hello,

[Filling-in for Elad - as we have a few follow-up tasks after we release
FAB provider for Airflow 2.9.]

Apache Airflow Providers (fab provider) prepared on March 25, 2024 have
been accepted.

4 "+1" binding votes received:
- Elad Kalif (binding)
- Jarek Potiuk (binding)
- Hussein Awala (binding)
- Kaxil Naik (binding)

1 "+1" non-binding votes received:
-  Amogh Desaui

Vote thread:
https://lists.apache.org/thread/9tjmb3pqkn1tnm6wnql6pg37jrk5m5sn

I'll continue with the release process, and the release announcement will
follow shortly.

Cheers,
J.


Re: Apache Airflow 2.9.0b2 available for testing

2024-04-02 Thread Jarek Potiuk
Proposed https://github.com/apache/airflow/pull/38675 - and likely we
should make a warning in Airflow 2.9 release notes that Python 3.12  (Which
will be the default in `apache/airflow`) cannot use Pendulum 2 and that
users are recommended to convert their code to be Pendulum 3 compatible, or
to use different Python version and downgrade to Pendulum 2

On Fri, Mar 29, 2024 at 10:42 AM Jarek Potiuk  wrote:

> bound to Pendulum 3 of course.
>
> On Fri, Mar 29, 2024 at 10:42 AM Jarek Potiuk  wrote:
>
>> Yes. IMHO we should add retroactively warning to 2.8.1 release notes with
>> comment that you can still downgrade to Pendulum 2 to get old behaviour and
>> then in 2.9 release notes mention that 3.12 users are bound to Pendulum 2.
>>
>> J.
>>
>>
>> On Fri, Mar 29, 2024 at 10:17 AM Bolke de Bruin 
>> wrote:
>>
>>> I think it's just nice to give a heads up to users. If you want to use
>>> python 3.12 you are tied to pendulum 3 and airflow 2.9+. It's our first
>>> release that supports 3.12 so I think it makes sense to add the note.
>>>
>>> B.
>>>
>>> Sent from my iPhone
>>>
>>> > On 28 Mar 2024, at 17:07, Jarek Potiuk  wrote:
>>> >
>>> > 
>>> >>
>>> >> It is as we require pendulum 3 for python 3.12 support.
>>> >
>>> > Not really. Pendulum 3+ is the only version that works with Python
>>> 3.12,
>>> > yes, But upgrading to Pendulum 3 was not connected to Airflow 2.9.
>>> >
>>> > Look it up in our repo. Pendulum <4 has been added in Airflow 2.8.1 and
>>> > since then we have Pendulum 3 in constraints of Airflow - which also
>>> means
>>> > that all Airflow versions for 2.8.1 already use Pendulum 3 when
>>> installed
>>> > via image, unless someone downgrades Pendulum to 2 manually (which is
>>> > possible for any Python version < 3.12.
>>> >
>>> > While possibly we should warn people that Pendulum 3 serialization
>>> changes
>>> > for ISO8601, it's not really connected with the Airflow 2.9 release.
>>> >
>>> > J.
>>> >
>>> >> On Thu, Mar 28, 2024 at 4:47 PM Bolke de Bruin 
>>> wrote:
>>> >>
>>> >> It is as we require pendulum 3 for python 3.12 support.
>>> >>
>>> >> Sent from my iPhone
>>> >>
>>> >>> On 28 Mar 2024, at 15:21, Nathan Hadfield 
>>> >> wrote:
>>> >>>
>>> >>> Yes, there’s this issue (turned discussion) about it but it is not
>>> >> specific to 2.9.
>>> >>>
>>> >>> https://github.com/apache/airflow/discussions/37037
>>> >>>
>>> >>> From: Bolke de Bruin 
>>> >>> Date: Thursday, 28 March 2024 at 13:01
>>> >>> To: dev@airflow.apache.org 
>>> >>> Subject: Re: Apache Airflow 2.9.0b2 available for testing
>>> >>> Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
>>> >> thought someone opened an issue for that (don't have it handy). Maybe
>>> it is
>>> >> something we should at least mention in the release notes? Sent from
>>> my
>>> >> iPhone > On 28 Mar
>>> >>> ZjQcmQRYFpfptBannerStart
>>> >>> This Message Is From an External Sender
>>> >>> This message came from outside your organization.
>>> >>>
>>> >>> ZjQcmQRYFpfptBannerEnd
>>> >>>
>>> >>> Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
>>> >> thought someone opened an issue for that (don't have it handy). Maybe
>>> it is
>>> >> something we should at least mention in the release notes?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Sent from my iPhone
>>> >>>
>>> >>>
>>> >>>
>>> >>>>> On 28 Mar 2024, at 00:54, Ephraim Anierobi <
>>> ephraimanier...@apache.org>
>>> >> wrote:
>>> >>>>
>>> >>>>
>>> >>>
>>> >>>> Hey fellow Airflowers,
>>> >>>
>>> >>>>
>>> >>>
>>> >>>> We have cut Airflow 2.9.0b2 now that all the main features have been
>>> >>>
>>> >>>> included.
>>> >>

Re: [DISCUSS] Proposal for adding Telemetry via Scarf

2024-03-30 Thread Jarek Potiuk
Hello everyone,

it has to be:

1. Opt-in by default to not trigger security guys about new unplanned
> activity after regular upgrade.
>

That's a very good point about security triggering Alexander, but I am not
so sure it means that we "have to" do opt-in. There are other ways of
communicating with the "deployment managers" who install and upgrade
airflow - i.e. release notes. blogs, social media of ours, slack
announcements etc. We have plenty of channels we can use to communicate the
change.

I think we have a very good blueprint to follow including at least 5 other
ASF projects that also passed the review of the privacy@asf. And while I
understand (and concur) the urge for opt-in by default coming from consumer
market (where it makes perfect sense) Airflow is not a consumer
software and is used in "corporate environment" which has a little
different expectations and broad assumption that the company can make
decisions on such telemetry on behalf of the employees using it.

We should assume that those who deploy and upgrade Airflow - actually read
and take into account what is written in the release notes - especially if
they have security guys breathing their necks, similarly as we have to
assume they follow CVE announcements about security issues fixed. If we
are very straightforward and out-going about the change, inform very
clearly how to opt-out, I don't see a big problem with opt-out.

We should of course check with privacy@a.o (but I'v spend a good deal of
time reading the Superset  and other use case and explanation in detail to
make a better informed decision) - and it looks like they also went opt-out
way and got cleared by privacy@a.o.  And if we cannot reach consensus, we
should - as usual - make a voting decision on it (because yes, it is an
important decision), but - after reading and understanding why others also
did it - for me personally, opt-out is a good path.

Also because it will rather increase the amount of data to gather, and in
our case - counter intuitively - it will be even better for privacy and
corporate anonymity, because the more data we get, the more difficult it
will be to get any non-statistical/non-aggregated insight from it. Imagine
if only a few corporate users will enable it consciously - then we will be
able to draw much more conclusions if we find out who they are, than if
everyone has it enabled by default.

That's my take on it - but again, it's up to us to vote, for me opt-in is
not "has to", and I am rather for opt-out.

J.

> Hi all,
>
>
> > I want to propose gathering telemetry for Airflow installations. As the
> > Airflow community, we have been relying heavily on the yearly Airflow
> > Survey and anecdotes to answer a few key questions about Airflow usage.
> > Questions like the following:
> >
> >
> >- Which versions of Airflow are people installing/using now (i.e.
> >whether people have primarily made the jump from version X to version
> Y)
> >- Which DB is used as the Metadata DB and which version e.g Pg 14?
> >- What Python version is being used?
> >- Which Executor is being used?
> >- Approximately how many people out there in the world are installing
> >Airflow
> >
> >
> > There is a solution that should help answer these questions: Scarf [1].
> The
> > ASF already approves Scarf [2][3] and is already used by other ASF
> > projects: Superset [4], Dolphin Scheduler [5], Dubbo Kubernetes, DevLake,
> > Skywalking as it follows GDPR and other regulations.
> >
> > Similar to Superset, we probably can use it as follows:
> >
> >
> >1. Install the `scarf js` npm package and bundle it in the Webserver.
> >When the package is downloaded & Airflow webserver is opened, metadata
> > is
> >recorded to the Scarf dashboard.
> >2. Utilize the Scarf Gateway [6], which we can use in front of docker
> >containers. While it’s possible people go around this gateway, we can
> >probably configure and encourage most traffic to go through these
> > gateways.
> >
> > While Scarf does not store any personally identifying information from
> SDK
> > telemetry data, it does send various bits of IP-derived information as
> > outlined here [7]. This data should be made as transparent as possible by
> > granting dashboard access to the Airflow PMC and any other relevant means
> > of sharing/surfacing it that we encounter (Town Hall, Slack, Newsletter
> > etc).
> >
> > The following case studies are worth reading:
> >
> >1. https://about.scarf.sh/post/scarf-case-study-apache-superset (From
> >Maxime)
> >2.
> >
> >
> https://about.scarf.sh/post/haskell-org-bridging-the-gap-between-language-innovation-and-community-understanding
> >
> > Similar to them, this could help in various ways that come with using
> data
> > for decision-making. With clear guidelines on "how to opt-out"
> [8][9][10] &
> > "what data is being collected" on the Airflow website, this can be
> > beneficial to the entire community as we would be 

Re: [DISCUSS] Proposal for adding Telemetry via Scarf

2024-03-29 Thread Jarek Potiuk
+1. All that sounds reasonable, there are precedents, ASF supports Scarf
officially. Would be great to have access to such telemetry data.

On Sat, Mar 30, 2024 at 1:18 AM Kaxil Naik  wrote:

> Hi all,
>
> I want to propose gathering telemetry for Airflow installations. As the
> Airflow community, we have been relying heavily on the yearly Airflow
> Survey and anecdotes to answer a few key questions about Airflow usage.
> Questions like the following:
>
>
>- Which versions of Airflow are people installing/using now (i.e.
>whether people have primarily made the jump from version X to version Y)
>- Which DB is used as the Metadata DB and which version e.g Pg 14?
>- What Python version is being used?
>- Which Executor is being used?
>- Approximately how many people out there in the world are installing
>Airflow
>
>
> There is a solution that should help answer these questions: Scarf [1]. The
> ASF already approves Scarf [2][3] and is already used by other ASF
> projects: Superset [4], Dolphin Scheduler [5], Dubbo Kubernetes, DevLake,
> Skywalking as it follows GDPR and other regulations.
>
> Similar to Superset, we probably can use it as follows:
>
>
>1. Install the `scarf js` npm package and bundle it in the Webserver.
>When the package is downloaded & Airflow webserver is opened, metadata
> is
>recorded to the Scarf dashboard.
>2. Utilize the Scarf Gateway [6], which we can use in front of docker
>containers. While it’s possible people go around this gateway, we can
>probably configure and encourage most traffic to go through these
> gateways.
>
> While Scarf does not store any personally identifying information from SDK
> telemetry data, it does send various bits of IP-derived information as
> outlined here [7]. This data should be made as transparent as possible by
> granting dashboard access to the Airflow PMC and any other relevant means
> of sharing/surfacing it that we encounter (Town Hall, Slack, Newsletter
> etc).
>
> The following case studies are worth reading:
>
>1. https://about.scarf.sh/post/scarf-case-study-apache-superset (From
>Maxime)
>2.
>
> https://about.scarf.sh/post/haskell-org-bridging-the-gap-between-language-innovation-and-community-understanding
>
> Similar to them, this could help in various ways that come with using data
> for decision-making. With clear guidelines on "how to opt-out" [8][9][10] &
> "what data is being collected" on the Airflow website, this can be
> beneficial to the entire community as we would be making more informed
> decisions.
>
> Regards,
> Kaxil
>
>
> [1] https://about.scarf.sh/
> [2] https://privacy.apache.org/policies/privacy-policy-public.html
> [3] https://privacy.apache.org/faq/committers.html
> [4] https://github.com/apache/superset/issues/25639
> [5]
>
> https://github.com/search?q=repo%3Aapache%2Fdolphinscheduler%20scarf.sh=code
> [6] https://about.scarf.sh/scarf-gateway
> [7] https://about.scarf.sh/privacy-policy
> [8]
>
> https://superset.apache.org/docs/frequently-asked-questions/#does-superset-collect-any-telemetry-data
> [9]
>
> https://superset.apache.org/docs/installation/installing-superset-using-docker-compose
> [10]
>
> https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics
>


Re: [VOTE] AIP-62 Getting Lineage from Hook Instrumentation

2024-03-29 Thread Jarek Potiuk
+1 (binding). I think that's a really important missing piece in lineage
integration.

On Thu, Mar 28, 2024 at 12:47 PM Maciej Obuchowski 
wrote:

> Hello,
> I would like to start a vote on AIP-62: Getting Lineage from Hook
> Instrumentation, to be implemented into (presumably) Airflow 2.10.
>
> The AIP can be found here:
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-62+Getting+Lineage+from+Hook+Instrumentation
>
> Discussion Thread:
> https://lists.apache.org/thread/5chxcp0zjcx66d3vs4qlrm8kl6l4s3m2
>
> The vote will last until 2024-04-03 18:00 UTC and until at least 3 binding
> votes have been cast.
>
> Please vote accordingly:
>
> [ ] + 1 approve
> [ ] + 0 no opinion
> [ ] - 1 disapprove with the reason
>
> Only votes from PMC members and committers are binding, but other members
> of the community are encouraged to check the AIP and vote with
> "(non-binding)".
>
> Consider this my binding +1 vote.
>
> Thanks,
> Maciej
>


Re: Apache Airflow 2.9.0b2 available for testing

2024-03-29 Thread Jarek Potiuk
bound to Pendulum 3 of course.

On Fri, Mar 29, 2024 at 10:42 AM Jarek Potiuk  wrote:

> Yes. IMHO we should add retroactively warning to 2.8.1 release notes with
> comment that you can still downgrade to Pendulum 2 to get old behaviour and
> then in 2.9 release notes mention that 3.12 users are bound to Pendulum 2.
>
> J.
>
>
> On Fri, Mar 29, 2024 at 10:17 AM Bolke de Bruin  wrote:
>
>> I think it's just nice to give a heads up to users. If you want to use
>> python 3.12 you are tied to pendulum 3 and airflow 2.9+. It's our first
>> release that supports 3.12 so I think it makes sense to add the note.
>>
>> B.
>>
>> Sent from my iPhone
>>
>> > On 28 Mar 2024, at 17:07, Jarek Potiuk  wrote:
>> >
>> > 
>> >>
>> >> It is as we require pendulum 3 for python 3.12 support.
>> >
>> > Not really. Pendulum 3+ is the only version that works with Python 3.12,
>> > yes, But upgrading to Pendulum 3 was not connected to Airflow 2.9.
>> >
>> > Look it up in our repo. Pendulum <4 has been added in Airflow 2.8.1 and
>> > since then we have Pendulum 3 in constraints of Airflow - which also
>> means
>> > that all Airflow versions for 2.8.1 already use Pendulum 3 when
>> installed
>> > via image, unless someone downgrades Pendulum to 2 manually (which is
>> > possible for any Python version < 3.12.
>> >
>> > While possibly we should warn people that Pendulum 3 serialization
>> changes
>> > for ISO8601, it's not really connected with the Airflow 2.9 release.
>> >
>> > J.
>> >
>> >> On Thu, Mar 28, 2024 at 4:47 PM Bolke de Bruin 
>> wrote:
>> >>
>> >> It is as we require pendulum 3 for python 3.12 support.
>> >>
>> >> Sent from my iPhone
>> >>
>> >>> On 28 Mar 2024, at 15:21, Nathan Hadfield 
>> >> wrote:
>> >>>
>> >>> Yes, there’s this issue (turned discussion) about it but it is not
>> >> specific to 2.9.
>> >>>
>> >>> https://github.com/apache/airflow/discussions/37037
>> >>>
>> >>> From: Bolke de Bruin 
>> >>> Date: Thursday, 28 March 2024 at 13:01
>> >>> To: dev@airflow.apache.org 
>> >>> Subject: Re: Apache Airflow 2.9.0b2 available for testing
>> >>> Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
>> >> thought someone opened an issue for that (don't have it handy). Maybe
>> it is
>> >> something we should at least mention in the release notes? Sent from my
>> >> iPhone > On 28 Mar
>> >>> ZjQcmQRYFpfptBannerStart
>> >>> This Message Is From an External Sender
>> >>> This message came from outside your organization.
>> >>>
>> >>> ZjQcmQRYFpfptBannerEnd
>> >>>
>> >>> Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
>> >> thought someone opened an issue for that (don't have it handy). Maybe
>> it is
>> >> something we should at least mention in the release notes?
>> >>>
>> >>>
>> >>>
>> >>> Sent from my iPhone
>> >>>
>> >>>
>> >>>
>> >>>>> On 28 Mar 2024, at 00:54, Ephraim Anierobi <
>> ephraimanier...@apache.org>
>> >> wrote:
>> >>>>
>> >>>>
>> >>>
>> >>>> Hey fellow Airflowers,
>> >>>
>> >>>>
>> >>>
>> >>>> We have cut Airflow 2.9.0b2 now that all the main features have been
>> >>>
>> >>>> included.
>> >>>
>> >>>>
>> >>>
>> >>>> This "snapshot" is intended for members of the Airflow developer
>> >> community
>> >>>
>> >>>> to test the build
>> >>>
>> >>>> and allow early testing of 2.9.0. Please test this beta and create
>> >> GitHub
>> >>>
>> >>>> issues wherever possible if you encounter bugs, (use 2.9.0b2 in the
>> >> version
>> >>>
>> >>>> dropdown filter when creating the issue).
>> >>>
>> >>>>
>> >>>
>> >>>> For clarity, this is not an official release of Apache Airflow
>> either -
>> >>>
>> >>>

Re: Apache Airflow 2.9.0b2 available for testing

2024-03-29 Thread Jarek Potiuk
Yes. IMHO we should add retroactively warning to 2.8.1 release notes with
comment that you can still downgrade to Pendulum 2 to get old behaviour and
then in 2.9 release notes mention that 3.12 users are bound to Pendulum 2.

J.


On Fri, Mar 29, 2024 at 10:17 AM Bolke de Bruin  wrote:

> I think it's just nice to give a heads up to users. If you want to use
> python 3.12 you are tied to pendulum 3 and airflow 2.9+. It's our first
> release that supports 3.12 so I think it makes sense to add the note.
>
> B.
>
> Sent from my iPhone
>
> > On 28 Mar 2024, at 17:07, Jarek Potiuk  wrote:
> >
> > 
> >>
> >> It is as we require pendulum 3 for python 3.12 support.
> >
> > Not really. Pendulum 3+ is the only version that works with Python 3.12,
> > yes, But upgrading to Pendulum 3 was not connected to Airflow 2.9.
> >
> > Look it up in our repo. Pendulum <4 has been added in Airflow 2.8.1 and
> > since then we have Pendulum 3 in constraints of Airflow - which also
> means
> > that all Airflow versions for 2.8.1 already use Pendulum 3 when installed
> > via image, unless someone downgrades Pendulum to 2 manually (which is
> > possible for any Python version < 3.12.
> >
> > While possibly we should warn people that Pendulum 3 serialization
> changes
> > for ISO8601, it's not really connected with the Airflow 2.9 release.
> >
> > J.
> >
> >> On Thu, Mar 28, 2024 at 4:47 PM Bolke de Bruin 
> wrote:
> >>
> >> It is as we require pendulum 3 for python 3.12 support.
> >>
> >> Sent from my iPhone
> >>
> >>> On 28 Mar 2024, at 15:21, Nathan Hadfield 
> >> wrote:
> >>>
> >>> Yes, there’s this issue (turned discussion) about it but it is not
> >> specific to 2.9.
> >>>
> >>> https://github.com/apache/airflow/discussions/37037
> >>>
> >>> From: Bolke de Bruin 
> >>> Date: Thursday, 28 March 2024 at 13:01
> >>> To: dev@airflow.apache.org 
> >>> Subject: Re: Apache Airflow 2.9.0b2 available for testing
> >>> Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
> >> thought someone opened an issue for that (don't have it handy). Maybe
> it is
> >> something we should at least mention in the release notes? Sent from my
> >> iPhone > On 28 Mar
> >>> ZjQcmQRYFpfptBannerStart
> >>> This Message Is From an External Sender
> >>> This message came from outside your organization.
> >>>
> >>> ZjQcmQRYFpfptBannerEnd
> >>>
> >>> Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
> >> thought someone opened an issue for that (don't have it handy). Maybe
> it is
> >> something we should at least mention in the release notes?
> >>>
> >>>
> >>>
> >>> Sent from my iPhone
> >>>
> >>>
> >>>
> >>>>> On 28 Mar 2024, at 00:54, Ephraim Anierobi <
> ephraimanier...@apache.org>
> >> wrote:
> >>>>
> >>>>
> >>>
> >>>> Hey fellow Airflowers,
> >>>
> >>>>
> >>>
> >>>> We have cut Airflow 2.9.0b2 now that all the main features have been
> >>>
> >>>> included.
> >>>
> >>>>
> >>>
> >>>> This "snapshot" is intended for members of the Airflow developer
> >> community
> >>>
> >>>> to test the build
> >>>
> >>>> and allow early testing of 2.9.0. Please test this beta and create
> >> GitHub
> >>>
> >>>> issues wherever possible if you encounter bugs, (use 2.9.0b2 in the
> >> version
> >>>
> >>>> dropdown filter when creating the issue).
> >>>
> >>>>
> >>>
> >>>> For clarity, this is not an official release of Apache Airflow either
> -
> >>>
> >>>> that doesn't happen until we make a release candidate and then vote on
> >> that.
> >>>
> >>>>
> >>>
> >>>> Airflow 2.9.0b2 is available at:
> >>>
> >>>>
> >>
> https://urldefense.com/v3/__https://dist.apache.org/repos/dist/dev/airflow/2.9.0b2/__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK7f_zUkew$
> >> <
> >>
> https://urldefense.com/v3/__https:/dist.apache.org/repos/dist/de

Re: Apache Airflow 2.9.0b2 available for testing

2024-03-28 Thread Jarek Potiuk
> It is as we require pendulum 3 for python 3.12 support.

Not really. Pendulum 3+ is the only version that works with Python 3.12,
yes, But upgrading to Pendulum 3 was not connected to Airflow 2.9.

Look it up in our repo. Pendulum <4 has been added in Airflow 2.8.1 and
since then we have Pendulum 3 in constraints of Airflow - which also means
that all Airflow versions for 2.8.1 already use Pendulum 3 when installed
via image, unless someone downgrades Pendulum to 2 manually (which is
possible for any Python version < 3.12.

While possibly we should warn people that Pendulum 3 serialization changes
for ISO8601, it's not really connected with the Airflow 2.9 release.

J.

On Thu, Mar 28, 2024 at 4:47 PM Bolke de Bruin  wrote:

> It is as we require pendulum 3 for python 3.12 support.
>
> Sent from my iPhone
>
> > On 28 Mar 2024, at 15:21, Nathan Hadfield 
> wrote:
> >
> > Yes, there’s this issue (turned discussion) about it but it is not
> specific to 2.9.
> >
> > https://github.com/apache/airflow/discussions/37037
> >
> > From: Bolke de Bruin 
> > Date: Thursday, 28 March 2024 at 13:01
> > To: dev@airflow.apache.org 
> > Subject: Re: Apache Airflow 2.9.0b2 available for testing
> > Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
> thought someone opened an issue for that (don't have it handy). Maybe it is
> something we should at least mention in the release notes? Sent from my
> iPhone > On 28 Mar
> > ZjQcmQRYFpfptBannerStart
> > This Message Is From an External Sender
> > This message came from outside your organization.
> >
> > ZjQcmQRYFpfptBannerEnd
> >
> > Pendulum 3 serializes ISO8601 slightly different (missing T) AFAIK. I
> thought someone opened an issue for that (don't have it handy). Maybe it is
> something we should at least mention in the release notes?
> >
> >
> >
> > Sent from my iPhone
> >
> >
> >
> >>> On 28 Mar 2024, at 00:54, Ephraim Anierobi 
> wrote:
> >>
> >>
> >
> >> Hey fellow Airflowers,
> >
> >>
> >
> >> We have cut Airflow 2.9.0b2 now that all the main features have been
> >
> >> included.
> >
> >>
> >
> >> This "snapshot" is intended for members of the Airflow developer
> community
> >
> >> to test the build
> >
> >> and allow early testing of 2.9.0. Please test this beta and create
> GitHub
> >
> >> issues wherever possible if you encounter bugs, (use 2.9.0b2 in the
> version
> >
> >> dropdown filter when creating the issue).
> >
> >>
> >
> >> For clarity, this is not an official release of Apache Airflow either -
> >
> >> that doesn't happen until we make a release candidate and then vote on
> that.
> >
> >>
> >
> >> Airflow 2.9.0b2 is available at:
> >
> >>
> https://urldefense.com/v3/__https://dist.apache.org/repos/dist/dev/airflow/2.9.0b2/__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK7f_zUkew$
> <
> https://urldefense.com/v3/__https:/dist.apache.org/repos/dist/dev/airflow/2.9.0b2/__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK7f_zUkew$
> >
> >
> >>
> >
> >> *apache-airflow-2.9.0-source.tar.gz* is a source release that comes with
> >
> >> INSTALL instructions.
> >
> >> *apache-airflow-2.9.0.tar.gz* is the binary Python "sdist" release.
> >
> >> *apache_airflow-2.9.0-py3-none-any.whl* is the binary Python wheel
> "binary"
> >
> >> release.
> >
> >>
> >
> >> This snapshot has been pushed to PyPI too at
> >
> >>
> https://urldefense.com/v3/__https://pypi.org/project/apache-airflow/2.9.0b2/__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK4p9M91Pw$
> <
> https://urldefense.com/v3/__https:/pypi.org/project/apache-airflow/2.9.0b2/__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK4p9M91Pw$
> >
> >
> >> and can be installed by running the following command:
> >
> >>
> >
> >> pip install 'apache-airflow==2.9.0b2'
> >
> >>
> >
> >> *Constraints files* are available at
> >
> >>
> https://urldefense.com/v3/__https://github.com/apache/airflow/tree/constraints-2.9.0b2__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK7E1a75CQ$
> <
> https://urldefense.com/v3/__https:/github.com/apache/airflow/tree/constraints-2.9.0b2__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK7E1a75CQ$
> >
> >
> >>
> >
> >> Release Notes:
> >
> >>
> https://urldefense.com/v3/__https://github.com/apache/airflow/blob/2.9.0b2/RELEASE_NOTES.rst__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK6HLhPrSA$
> <
> https://urldefense.com/v3/__https:/github.com/apache/airflow/blob/2.9.0b2/RELEASE_NOTES.rst__;!!Ci6f514n9QsL8ck!nDdmZt2BM4BAERowEi8OynoVSS90zEDxAT-yVVfksdVJPSKQfsZS8OWgT5yOKneeZp24s5p5IK6HLhPrSA$
> >
> >
> >>
> >
> >> *Changes since 2.8.4:*
> >
> >>
> >
> >> *Significant Changes*
> >
> >>
> >
> >> *Following Listener API methods are considered stable and can be used
> for
> >

Re: [DISCUSS] DRAFT AIP-67 Multi-tenant deployment of Airflow components

2024-03-27 Thread Jarek Potiuk
seeing this context.

J.



On Wed, Mar 27, 2024 at 10:58 AM Ash Berlin-Taylor  wrote:

> Thanks very much Jarek and folks for working on this proposal.
>
> I think we can ultimately get Airflow better as a result, but there was
> something that wasn’t sitting quite right with me, and I think I’ve finally
> managed to articulate it in to words. (I left this as a comment on the wiki
> AIP page but wanted to bring it back to the list too for increased
> visibility)
>
> I started off feeling nervous about calling this multi-tenancy as my view
> is that term should be reserved for something that has stronger boundaries
> betwen the tenants. The ambiguity to me is around how strong the boundary
> between tenants is — and what is the mental model users should have around
> how much grief a bad neighbour could cause (intentionally or not).
> This seems like a fairly big change billed as multi-tenancy but to my mind
> it isn't it doesn’t fit that name, and if we are only talking about a
> per-team boundary (crucially: the boundary is between groups/teams within
> one company) then could we instead meet these needs more directly and less
> ambiguously by adding Variable and Connection RBAC (I.e. think along the
> lines of team/dag prefix-X has permissions to these Secrets only?)
>
> That is to my mind closer to what I feel the intent of this would be. Yes,
> that would mean conn names (et al) are a global namespace. I think for some
> users it would be a benefit – for example access to a central DWH would
> only need to be managed (and rotated) once, not once per team and then
> access given to specific teams. It also doesn't lead to overloading of MT
> meaning multiple things to multiple people.
>
> This could build on all the same things already mentioned in this AIP
> (multiple executors per team, multiple dag parsers etc) but is a different
> way of approaching the problem.
>
> This then also maintains the fact that some things are still going to be a
> global namespace, specifically users. There is one list of users/one IdP
> configured and users exist in one (or more) teams.
>
> I think this might also be a smaller change overall, as the only
> namespacing we'd need would be on DAG ids and that could be enforced by
> policy (as in something like ", and everything else could be managed by
> RBAC etc. In fact you are in some ways already talking about adding this
> sort of thing for the allow list on who can trigger a dataset. So lets
> formalize that and expand it a bit.
>
> The other advantage of approaching it this way is that it adds features
> that would be useful for people who are happy with the current "shared"
> infrastructure approach (i.e. happy with multiple teams in one Airflow,
> using existing DAG level rbac etc) but would like the ability to add
> Connection restriction without forcing them to (their mind) complicate
> their deployment story.
>
>
> > On 18 Mar 2024, at 04:42, Amogh Desai  wrote:
> >
> > Thanks Jarek and Shubham for the clarifications.
> >
> > Looking forward to this one!
> >
> > Thanks & Regards,
> > Amogh Desai
> >
> > On Sat, Mar 16, 2024 at 10:10 AM Mehta, Shubham
> mailto:shu...@amazon.com.invalid>>
> > wrote:
> >
> >> Jarek - I totally agree. We had a similar conversation in Dec 2022, and
> >> Filip K. from Google suggested [1] calling them "workspaces." But I
> think
> >> most of our users and contributors are used to the word "tenant."
> Trying a
> >> new term like "workspaces" just seems to make things more confusing. For
> >> example, when I tried using it with a couple of developers at AWS, who
> were
> >> somewhat familiar with Airflow, it immediately prompted questions about
> its
> >> relation to "tenants."
> >>
> >> I also liked how you explained it in the AIP response. It's like when
> >> Kubernetes talks about "multi-tenant" [2] and they mean it could be for
> >> different customers or different teams. What we’re doing with Airflow is
> >> for teams, not really different customers. But it's simpler to keep
> calling
> >> it "multi-tenant," just like Kubernetes does, and make sure we explain
> that
> >> we're talking about different teams.
> >>
> >> Reference:
> >> 1.
> >>
> https://docs.google.com/document/d/1n23h26p4_8F5-Cd0JGLPEnF3gumJ5hw3EpwUljz7HcE/edit?disco=lo3bv6Q
> >> 2. https://kubernetes.io/docs/concepts/security/multi-tenancy/
> >>
> >> Shubham
> >>
> >> On 2024-03-15, 2:50 AM, "Jarek Potiuk"  >> ja...@

Re: [VOTE] March 2024 PR of the Month

2024-03-25 Thread Jarek Potiuk
+1  on Python 3.12 (#36755) - with the note that it's been a joint effort
of quite a few people - Andrey, Bolke, Gopal Dirisao who contributed to it
in Airflow, but also a number people who are maintainers of our
dependencies: Israel Fruchter and Bret McGuire from Cassandra,  Andreas
Poehlmann from universal_pathlib , Jessie Whitehouse who figured out the
databricks driver (and general) coverage-induced slow-downs.

I probably forgot someone :).

J.

On Mon, Mar 25, 2024 at 10:51 PM Scheffler Jens (XC-AS/EAE-ADA-T)
 wrote:

> My vote is +1 for #36755 - mainly because it was really a long - runner
> and important we made it!
>
> Sent from Outlook for iOS
> 
> From: Briana Okyere 
> Sent: Monday, March 25, 2024 8:53:32 PM
> To: dev@airflow.apache.org 
> Subject: [VOTE] March 2024 PR of the Month
>
> Hey All,
>
> It’s once again time to vote for the PR of the Month!
>
> With the help of the `get_important_pr_candidates` script in dev/stats,
> we've identified the following candidates:
>
>  PR #37937: refactor: Refactored __new__ magic method of BaseOperatorMeta
> to avoid bad mixing classic and decorated operators <
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F37937=05%7C02%7CJens.Scheffler%40de.bosch.com%7C0a80d271194c4cb01fc508dc4d055a9e%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469932608262236%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=PgitHoOv1s7k1lqG9oaqog4Ajo2gm7K9bdPlkITx5yE%3D=0
> >
>
> PR #37821: Add `ADLSCreateObjectOperator` <
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F37821=05%7C02%7CJens.Scheffler%40de.bosch.com%7C0a80d271194c4cb01fc508dc4d055a9e%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469932608272501%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=qV6tNfyonXsa0R7j%2FNJExEWIfQWdc2NWmpzokCY27jA%3D=0
> >
>
>  PR #37458: Add Yandex Query support from Yandex.Cloud <
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F37458=05%7C02%7CJens.Scheffler%40de.bosch.com%7C0a80d271194c4cb01fc508dc4d055a9e%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469932608279527%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=vodH9CqvR%2FQeDvkUt0YgkuS9WRJKD7BIDojjcrZETTU%3D=0
> >
>
> PR #36935: Adding ability to automatically set DAG to off after X times it
> failed sequentially
> <
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F36935=05%7C02%7CJens.Scheffler%40de.bosch.com%7C0a80d271194c4cb01fc508dc4d055a9e%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469932608284177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=G7dIUFdwRlEoC8sI4m4gc%2BVmZgDVvJcw2ZYfu8MaD1c%3D=0
> >
>
>  PR #36755: Python3.12 <
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F36755=05%7C02%7CJens.Scheffler%40de.bosch.com%7C0a80d271194c4cb01fc508dc4d055a9e%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469932608288507%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=WfyCRalWyPhIADBxFhmcS0X7dWSdjNHPbbJPA0lMhcY%3D=0
> >
>
> Please reply to this thread with your selection or offer your own
> nominee(s).
>
> Voting will close on Friday, March 29th at 10 AM PST. The winner(s) will be
> featured in the next issue of the Airflow newsletter.
>
> Also, if there’s an article or event that you think should be included in
> this or a future issue of the newsletter, please drop me a line at <
> briana.oky...@astronomer.io>
>
> --
> Briana Okyere
> Community Manager
> Astronomer
>


Re: [VOTE] AIP-64: Keep TaskInstance try history

2024-03-25 Thread Jarek Potiuk
+1 (binding)

On Mon, Mar 25, 2024 at 10:51 PM Scheffler Jens (XC-AS/EAE-ADA-T)
 wrote:

> +1 binding
>
> Sent from Outlook for iOS
> 
> From: Pierre Jeambrun 
> Sent: Monday, March 25, 2024 9:19:36 PM
> To: dev@airflow.apache.org 
> Subject: Re: [VOTE] AIP-64: Keep TaskInstance try history
>
> +1 (binding)
>
> On Mon 25 Mar 2024 at 19:32, Igor Kholopov 
> wrote:
>
> > +1 (non-binding)
> >
> > On Mon, Mar 25, 2024 at 7:28 PM Tzu-ping Chung  >
> > wrote:
> >
> > > +1 binding.
> > >
> > > This is something we should just do even outside of the DAG versioning
> > > context (which we also should do, to be clear).
> > >
> > > TP
> > >
> > >
> > > > On Mar 26, 2024, at 02:12, Aritra Basu 
> > wrote:
> > > >
> > > > +1 (non-binding)
> > > > This is quite good, I'd thought in passing that it'd be useful to
> have.
> > > >
> > > > --
> > > > Regards,
> > > > Aritra Basu
> > > >
> > > > On Mon, Mar 25, 2024, 10:46 PM Jed Cunningham <
> > jedcunning...@apache.org>
> > > > wrote:
> > > >
> > > >> Hello Airflow Community,
> > > >>
> > > >> I would like to start a vote on AIP-64: Keep TaskInstance try
> history.
> > > >>
> > > >> You can find the AIP here:
> > > >>
> > > >>
> > >
> >
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-64%253A%2BKeep%2BTaskInstance%2Btry%2Bhistory=05%7C02%7CJens.Scheffler%40de.bosch.com%7Cb9bc3ca48fe64e90cfa508dc4d08eed3%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469947955266031%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=kFyVySPHT9GnX0nPEWT4S9l3nBzulv30%2FL1Q56Uuisc%3D=0
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-64%3A+Keep+TaskInstance+try+history
> >
> > > >>
> > > >> Discussion Thread:
> > > >>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fthread%2Fvvm43tfchyo92hmf40fqvmq0f5845bjr=05%7C02%7CJens.Scheffler%40de.bosch.com%7Cb9bc3ca48fe64e90cfa508dc4d08eed3%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469947955274281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=UeOEK7OfNbM4ka7GXziO4n6D9xdL7e7Zs%2Fd55MH5QZM%3D=0
> 
> > > >>
> > > >> This is the first step in the AIP-63 DAG Versioning journey, though
> > this
> > > >> provides value in isolation as well:
> > > >>
> > > >>
> > >
> >
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-63%253A%2BDAG%2BVersioning=05%7C02%7CJens.Scheffler%40de.bosch.com%7Cb9bc3ca48fe64e90cfa508dc4d08eed3%7C0ae51e1907c84e4bbb6d648ee58410f4%7C0%7C0%7C638469947955279994%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=bubkyBIPWxDYqTRyCQbozLtVuw1bmK4KXkxnBM2bccA%3D=0
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-63%3A+DAG+Versioning
> >
> > > >>
> > > >> The vote will last until 2024-03-28 17:30 UTC and until at least 3
> > > binding
> > > >> votes have been cast.
> > > >>
> > > >> Consider this my binding +1.
> > > >>
> > > >> Please vote accordingly:
> > > >>
> > > >> [ ] + 1 approve
> > > >> [ ] + 0 no opinion
> > > >> [ ] - 1 disapprove with the reason
> > > >>
> > > >> Only votes from PMC members and committers are binding, but other
> > > members
> > > >> of the community are encouraged to check the AIP and vote with
> > > >> "(non-binding)".
> > > >>
> > > >> Thanks,
> > > >> Jed
> > > >>
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > For additional commands, e-mail: dev-h...@airflow.apache.org
> > >
> > >
> >
>


Re: [VOTE] Airflow Providers prepared on March 25, 2024

2024-03-25 Thread Jarek Potiuk
+1 (binding): checked signatures, checksums, licences, reproducibility.

I build latest main `airflow` (that will become b2) as `2.9.0rc1` package
and installed it together with apache-airflow-providers-fab `1.0.2rc1` and
they work nicely together. I was able to create users, authenticate, login
to webserver, run a few example dags. All looks good.

On Mon, Mar 25, 2024 at 3:00 PM Elad Kalif  wrote:

> Hey all,
>
> I have just cut the new wave Airflow Providers packages. This email is
> calling a vote on the release,
> which will last for 72 hours - which means that it will end on March 28,
> 2024 13:55 PM UTC and until 3 binding +1 votes have been received.
>
>
> Consider this my (binding) +1.
>
> This release is only for FAB provider to be tested along with Airflow 2.9.0
> rc11
>
>
> Airflow Providers are available at:
> https://dist.apache.org/repos/dist/dev/airflow/providers/
>
> *apache-airflow-providers--*.tar.gz* are the binary
>  Python "sdist" release - they are also official "sources" for the provider
> packages.
>
> *apache_airflow_providers_-*.whl are the binary
>  Python "wheel" release.
>
> The test procedure for PMC members is described in
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-pmc-members
>
> The test procedure for and Contributors who would like to test this RC is
> described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-contributors
>
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but members of the community are
> encouraged to test the release and vote with "(non-binding)".
>
> Please note that the version number excludes the 'rcX' string.
> This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release.
>
> The status of testing the providers by the community is skipped because we
> test the fab provider as a whole rather than specific commits.
>
> The issue is also the easiest way to see important PRs included in the RC
> candidates.
> Detailed changelog for the providers will be published in the documentation
> after the
> RC candidates are released.
>
> You can find the RC packages in PyPI following these links:
>
> https://pypi.org/project/apache-airflow-providers-fab/1.0.2rc1/
>
> Cheers,
> Elad Kalif
>


Re: [DISCUSS] Applying D105 rule for our codebase ("undocumented magic methods") ?

2024-03-25 Thread Jarek Potiuk
PR to remove it https://github.com/apache/airflow/pull/38452

On Mon, Mar 25, 2024 at 10:14 AM Jarek Potiuk  wrote:

> Ok. Seems we have a consensus :)
>
> On Mon, Mar 25, 2024 at 7:13 AM Wei Lee  wrote:
>
>> +1 for removing this rule. If docstring is needed for some cases, we can
>> still do that or add comments to PRs.
>>
>> Best,
>> Wei
>>
>> > On Mar 25, 2024, at 1:31 PM, Amogh Desai 
>> wrote:
>> >
>> > After reading the emails in this thread, I too think that this rule is
>> not
>> > generally very useful, but we should have
>> > it for cases which are special (where we are overriding some magic
>> method
>> > to do something different or more
>> > than its reserved meaning)
>> >
>> > +1 for removal of the rule, with special cases being handled separately.
>> >
>> > Thanks & Best Regards,
>> > Amogh Desai
>> >
>> > On Wed, Mar 20, 2024 at 11:46 PM Oliveira, Niko
>> 
>> > wrote:
>> >
>> >> I'm -1 to enabling D105
>> >>
>> >>
>> >> I don't think it will lead to helpful documentation. I think for the
>> rare
>> >> cases it is required it can left up to the developer or caught in PR
>> review.
>> >>
>> >> Cheers,
>> >> Niko
>> >>
>> >> 
>> >> From: Vincent Beck 
>> >> Sent: Wednesday, March 20, 2024 5:51:43 AM
>> >> To: dev@airflow.apache.org
>> >> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Applying D105 rule
>> >> for our codebase ("undocumented magic methods") ?
>> >>
>> >> CAUTION: This email originated from outside of the organization. Do not
>> >> click links or open attachments unless you can confirm the sender and
>> know
>> >> the content is safe.
>> >>
>> >>
>> >>
>> >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> externe.
>> >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
>> pouvez
>> >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
>> que
>> >> le contenu ne présente aucun risque.
>> >>
>> >>
>> >>
>> >> +1 for not enforcing as well. Let's leave to maintainers the
>> flexibility
>> >> to chose whether a given method should be documented.
>> >>
>> >> On 2024/03/20 08:28:51 Ash Berlin-Taylor wrote:
>> >>> I'm for not enforcing this rule - as others have said its very
>> unlikely
>> >> to result in more useful docs for developers or end users.
>> >>>
>> >>> -asg
>> >>>
>> >>> On 20 March 2024 08:12:40 GMT, Andrey Anshin <
>> andrey.ans...@taragol.is>
>> >> wrote:
>> >>>> ±0 from my side
>> >>>>
>> >>>> Maybe we have to review all current methods which do not follow this
>> >> rule
>> >>>> to find a really useful meaning, and do not enforce (disable it).
>> >>>> So for avoid unnecessary changes we might close
>> >>>> https://github.com/apache/airflow/issues/37523 and remove/mark
>> >> completed
>> >>>> into the https://github.com/apache/airflow/issues/10742
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Wed, 20 Mar 2024 at 10:41, Pankaj Koti > >> .invalid>
>> >>>> wrote:
>> >>>>
>> >>>>> +1 to what Aritra is saying.
>> >>>>>
>> >>>>>
>> >>>>> Best regards,
>> >>>>>
>> >>>>> *Pankaj Koti*
>> >>>>> Senior Software Engineer (Airflow OSS Engineering team)
>> >>>>> Location: Pune, Maharashtra, India
>> >>>>> Timezone: Indian Standard Time (IST)
>> >>>>> Phone: +91 9730079985
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Mar 20, 2024 at 12:05 PM Aritra Basu <
>> >> aritrabasu1...@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I'm in general not a huge fan of documenting for the sake of
>> >> documenting,
>> >>>

Re: [DISCUSS] Applying D105 rule for our codebase ("undocumented magic methods") ?

2024-03-25 Thread Jarek Potiuk
Ok. Seems we have a consensus :)

On Mon, Mar 25, 2024 at 7:13 AM Wei Lee  wrote:

> +1 for removing this rule. If docstring is needed for some cases, we can
> still do that or add comments to PRs.
>
> Best,
> Wei
>
> > On Mar 25, 2024, at 1:31 PM, Amogh Desai 
> wrote:
> >
> > After reading the emails in this thread, I too think that this rule is
> not
> > generally very useful, but we should have
> > it for cases which are special (where we are overriding some magic method
> > to do something different or more
> > than its reserved meaning)
> >
> > +1 for removal of the rule, with special cases being handled separately.
> >
> > Thanks & Best Regards,
> > Amogh Desai
> >
> > On Wed, Mar 20, 2024 at 11:46 PM Oliveira, Niko
> 
> > wrote:
> >
> >> I'm -1 to enabling D105
> >>
> >>
> >> I don't think it will lead to helpful documentation. I think for the
> rare
> >> cases it is required it can left up to the developer or caught in PR
> review.
> >>
> >> Cheers,
> >> Niko
> >>
> >> 
> >> From: Vincent Beck 
> >> Sent: Wednesday, March 20, 2024 5:51:43 AM
> >> To: dev@airflow.apache.org
> >> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Applying D105 rule
> >> for our codebase ("undocumented magic methods") ?
> >>
> >> CAUTION: This email originated from outside of the organization. Do not
> >> click links or open attachments unless you can confirm the sender and
> know
> >> the content is safe.
> >>
> >>
> >>
> >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> >> le contenu ne présente aucun risque.
> >>
> >>
> >>
> >> +1 for not enforcing as well. Let's leave to maintainers the flexibility
> >> to chose whether a given method should be documented.
> >>
> >> On 2024/03/20 08:28:51 Ash Berlin-Taylor wrote:
> >>> I'm for not enforcing this rule - as others have said its very unlikely
> >> to result in more useful docs for developers or end users.
> >>>
> >>> -asg
> >>>
> >>> On 20 March 2024 08:12:40 GMT, Andrey Anshin  >
> >> wrote:
> >>>> ±0 from my side
> >>>>
> >>>> Maybe we have to review all current methods which do not follow this
> >> rule
> >>>> to find a really useful meaning, and do not enforce (disable it).
> >>>> So for avoid unnecessary changes we might close
> >>>> https://github.com/apache/airflow/issues/37523 and remove/mark
> >> completed
> >>>> into the https://github.com/apache/airflow/issues/10742
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Wed, 20 Mar 2024 at 10:41, Pankaj Koti  >> .invalid>
> >>>> wrote:
> >>>>
> >>>>> +1 to what Aritra is saying.
> >>>>>
> >>>>>
> >>>>> Best regards,
> >>>>>
> >>>>> *Pankaj Koti*
> >>>>> Senior Software Engineer (Airflow OSS Engineering team)
> >>>>> Location: Pune, Maharashtra, India
> >>>>> Timezone: Indian Standard Time (IST)
> >>>>> Phone: +91 9730079985
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 20, 2024 at 12:05 PM Aritra Basu <
> >> aritrabasu1...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I'm in general not a huge fan of documenting for the sake of
> >> documenting,
> >>>>>> so I'd be in agreement of not enforcing it via code but rather be
> >>>>> enforced
> >>>>>> by the reviewers in cases they believe certain methods need
> >> documenting.
> >>>>>>
> >>>>>> --
> >>>>>> Regards,
> >>>>>> Aritra Basu
> >>>>>>
> >>>>>> On Wed, Mar 20, 2024, 9:39 AM Jarek Potiuk 
> >> wrote:
> >>>>>>
> >>>>>>> Hey here,
> >>>>>>>
> >>>>>>> I w

Re: [VOTE] Release Apache Airflow Helm Chart 1.13.1 based on 1.13.1rc1

2024-03-22 Thread Jarek Potiuk
+1 binding. Verified reproducibility, signatures, licences, checksums.
Installed it locally and got Airflow working.

Though - one comment. While doing it now, the prometheus exporter image has
not been pulled yet and the exporter is failing. The problem is that `
quay.io` is going through an outage currently it seems.

It's a RedHat service, so I presume it's a temporary blip, I am not sure
how reliable it is, but it seems to be going for well over 30 minutes for
me at least and does not show a sign of recovery, so maybe we
should consider switching to dockerhub for all our images in the future -
at least we will rely on a single service, not two of those. It seems quay
is far less reliable than dockerhub:

Quay last 30 days::

[image: image.png]

Last 24 hrs:
[image: image.png]

Docker:

[image: image.png]


J,


On Thu, Mar 21, 2024 at 7:24 PM Jed Cunningham 
wrote:

> Hello Apache Airflow Community,
>
> This is a call for the vote to release Helm Chart version 1.13.1.
>
> The release candidate is available at:
> https://dist.apache.org/repos/dist/dev/airflow/helm-chart/1.13.1rc1/
>
> airflow-chart-1.13.1-source.tar.gz - is the "main source release" that
> comes with INSTALL instructions.
> airflow-1.13.1.tgz - is the binary Helm Chart release.
>
> Public keys are available at: https://www.apache.org/dist/airflow/KEYS
>
> For convenience "index.yaml" has been uploaded (though excluded from
> voting), so you can also run the below commands.
>
> helm repo add apache-airflow-dev
> https://dist.apache.org/repos/dist/dev/airflow/helm-chart/1.13.1rc1/
> helm repo update
> helm install airflow apache-airflow-dev/airflow
>
> airflow-1.13.1.tgz.prov - is also uploaded for verifying Chart Integrity,
> though not strictly required for releasing the artifact based on ASF
> Guidelines.
>
> $ helm gpg verify airflow-1.13.1.tgz
> gpg: Signature made Thu Mar 21 12:13:02 2024 MDT
> gpg:using RSA key E1A1E984F55B8F280BD9CBA20BB7163892A2E48E
> gpg:issuer "jedcunning...@apache.org"
> gpg: Good signature from "Jed Cunningham "
> [ultimate]
> plugin: Chart SHA verified.
> sha256:ed6b2dea0d8f99eb9bd9cd6bc418db95f88a7bb3b8d1afb7fdc266b1ea411a15
>
> The vote will be open for at least 72 hours (2024-03-24 18:29 UTC) or until
> the necessary number of votes is reached.
>
>
> https://www.timeanddate.com/countdown/to?iso=20240324T1829=136=cursive
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but members of the community are
> encouraged to test the release and vote with "(non-binding)".
>
> Consider this my (binding) +1.
>
> For license checks, the .rat-excludes files is included, so you can run the
> following to verify licenses (just update your path to rat):
>
> tar -xvf airflow-chart-1.13.1-source.tar.gz
> cd airflow-chart-1.13.1
> java -jar apache-rat-0.13.jar chart -E .rat-excludes
>
> Please note that the version number excludes the `rcX` string, so it's now
> simply 1.13.1. This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release it.
>
> The status of testing the Helm Chart by the community is kept here:
> https://github.com/apache/airflow/issues/38382
>
> Thanks,
> Jed
>


Re: Apache Airflow 2.9.0b1 available for testing

2024-03-21 Thread Jarek Potiuk
Just wanted to stress a note for this one:

*apache-airflow-providers-fab==1.0.2b0 must be installed alongside*
*this snapshot for Airflow to work.*

(we will improve it in the next release)
J.


On Thu, Mar 21, 2024 at 1:12 PM Ephraim Anierobi 
wrote:

> Hey fellow Airflowers,
>
> We have cut Airflow 2.9.0b1 now that all the main features have been
> included.
>
> Note: apache-airflow-providers-fab==1.0.2b0 must be installed alongside
> this snapshot for Airflow to work.
>
> This "snapshot" is intended for members of the Airflow developer community
> to test the build
> and allow early testing of 2.9.0. Please test this beta and create GitHub
> issues wherever possible if you encounter bugs, (use 2.9.0b1 in the version
> dropdown filter when creating the issue).
>
> For clarity, this is not an official release of Apache Airflow either -
> that doesn't happen until we make a release candidate and then vote on
> that.
>
> Airflow 2.9.0b1 is available at:
> https://dist.apache.org/repos/dist/dev/airflow/2.9.0b1/
>
> *apache-airflow-2.9.0-source.tar.gz* is a source release that comes with
> INSTALL instructions.
> *apache-airflow-2.9.0.tar.gz* is the binary Python "sdist" release.
> *apache_airflow-2.9.0-py3-none-any.whl* is the binary Python wheel "binary"
> release.
>
> This snapshot has been pushed to PyPI too at
> https://pypi.org/project/apache-airflow/2.9.0b1/
> and can be installed by running the following command:
>
> pip install 'apache-airflow==2.9.0b1'
>
> *Constraints files* are available at
> https://github.com/apache/airflow/tree/constraints-2.9.0b1
>
> Release Notes:
> https://github.com/apache/airflow/blob/2.9.0b1/RELEASE_NOTES.rst
>
> *Changes since 2.8.4rc1:*
>
> *Significant Changes*
>
>
> *Following Listener API methods are considered stable and can be used for
> production system (were experimental feature in older Airflow versions)
> (#36376):*
> Lifecycle events:
>
> - ``on_starting``
> - ``before_stopping``
>
> DagRun State Change Events:
>
> - ``on_dag_run_running``
> - ``on_dag_run_success``
> - ``on_dag_run_failed``
>
> TaskInstance State Change Events:
>
> - ``on_task_instance_running``
> - ``on_task_instance_success``
> - ``on_task_instance_failed``
>
> *Support for Microsoft SQL-Server for Airflow Meta Database has been
> removed (#36514)*
>
> After discussion
>  and a
> voting
> process  >,
> the Airflow's PMC and Committers have reached a resolution to no longer
> maintain MsSQL as a supported Database Backend.
>
> As of Airflow 2.9.0 support of MsSQL has been removed for Airflow Database
> Backend.
>
> A migration script which can help migrating the database *before* upgrading
> to Airflow 2.9.0 is available in airflow-mssql-migration repo on Github
> .
>
> Note that the migration script is provided without support and warranty.
>
> This does not affect the existing provider packages (operators and hooks),
> DAGs can still access and process data from MsSQL.
>
> *Dataset URIs are now validated on input (#37005)*
>
> Datasets must use a URI that conform to rules laid down in AIP-60, and the
> value
> will be automatically normalized when the DAG file is parsed. See
> `documentation on Datasets <
>
> https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/datasets.html
> >`_
> for
> a more detailed description on the rules.
>
> You may need to change your Dataset identifiers if they look like a URI,
> but are
> used in a less mainstream way, such as relying on the URI's auth section,
> or
> have a case-sensitive protocol name.
>
> *The method ``get_permitted_menu_items`` in ``BaseAuthManager`` has been
> renamed ``filter_permitted_menu_items`` (#37627)*
>
> *Add REST API actions to Audit Log events (#37734)*
>
> The Audit Log ``event`` name for REST API events will be prepended with
> ``api.`` or ``ui.``, depending on if it came from the Airflow UI or
> externally.
>
> *Airflow 2.9.0 is the first release that officially supports Python 3.12
> (#38025)*
> There are a few caveats though:
>
> * Pendulum2 does not support Python 3.12. For Python 3.12 you need to use
>   `Pendulum 3 <
> https://pendulum.eustace.io/blog/announcing-pendulum-3-0-0.html>`_
>
> * Minimum SQLAlchemy version supported when Pandas is installed for Python
> 3.12 is ``1.4.36`` released in
>   April 2022. Airflow 2.9.0 increases the minimum supported version of
> SQLAlchemy to ``1.4.36`` for all
>   Python versions.
>
> Not all Providers support Python 3.12. At the initial release of Airflow
> 2.9.0 the following providers
> are released without support for Python 3.12:
>
>   * ``apache.beam`` - pending on `Apache Beam support for 3.12 <
> https://github.com/apache/beam/issues/29149>`_
>   * ``papermill`` - pending on Releasing Python 3.12 compatible papermill
> client version
> `including this merged 

Re: [VOTE] Release Airflow 2.8.4 from 2.8.4rc1

2024-03-20 Thread Jarek Potiuk
+1 (binding) - tested / verified all changes I was involved (either as
fixer, bug introducer or both, particularly when both), verified
reproducibility, licences, checksums, signatures, run a few DAGs -  all
looks good.

On Wed, Mar 20, 2024 at 4:56 PM Jed Cunningham 
wrote:

> Hey fellow Airflowers,
>
> I have cut Airflow 2.8.4rc1. This email is calling a vote on the release,
> which will last at least 72 hours, from Wednesday, March 20, 2024 at 4:00
> pm UTC
> until Saturday, March 23, 2024 at 4:00 pm UTC, and until 3 binding +1 votes
> have been received.
>
>
> https://www.timeanddate.com/worldclock/fixedtime.html/?msg=8=20240323T1600=1440
>
> Status of testing of the release is kept in
> https://github.com/apache/airflow/issues/38334
>
> Consider this my (binding) +1.
>
> Airflow 2.8.4rc1 is available at:
> https://dist.apache.org/repos/dist/dev/airflow/2.8.4rc1/
>
> *apache-airflow-2.8.4-source.tar.gz* is a source release that comes with
> INSTALL instructions.
> *apache-airflow-2.8.4.tar.gz* is the binary Python "sdist" release.
> *apache_airflow-2.8.4-py3-none-any.whl* is the binary Python wheel "binary"
> release.
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but all members of the community
> are encouraged to test the release and vote with "(non-binding)".
>
> The test procedure for PMC members is described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#verify-the-release-candidate-by-pmc-members
>
> The test procedure for and Contributors who would like to test this RC is
> described in:
>
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#verify-the-release-candidate-by-contributors
>
>
> Please note that the version number excludes the `rcX` string, so it's now
> simply 2.8.4. This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release.
>
> Release Notes:
> https://github.com/apache/airflow/blob/2.8.4rc1/RELEASE_NOTES.rst
>
> For information on what goes into a release please see:
>
> https://github.com/apache/airflow/blob/main/dev/WHAT_GOES_INTO_THE_NEXT_RELEASE.md
>
> Changes since 2.8.3:
> *Bugs*:
> - Fix incorrect serialization of ``FixedTimezone`` (#38139)
> - Fix excessive permission changing for log task handler (#38164)
> - Fix task instances list link (#38096)
> - Fix a bug where scheduler heartrate parameter was not used (#37992)
> - Add padding to prevent grid horizontal scroll overlapping tasks (#37942)
> - Fix hash caching in ``ObjectStoragePath`` (#37769)
>
> *Miscellaneous*:
> - Limit importlib_resources as it breaks ``pytest_rewrites`` (#38095,
> #38139)
> - Limit ``pandas`` to ``<2.2`` (#37748)
> - Bump ``croniter`` to fix an issue with 29 Feb cron expressions (#38198)
>
> *Doc-only Change*:
> - Tell users what to do if their scanners find issues in the image (#37652)
> - Add a section about debugging in Docker Compose with PyCharm (#37940)
> - Update deferrable docs to clarify kwargs when trigger resumes operator
> (#38122)
>
> Thanks,
> Jed
>


Re: Bad mixing of decorated and classic operators (users shooting themselves in their foot)

2024-03-20 Thread Jarek Potiuk
I am really torn on that one to be honest.

I am OK with the error (with the note that it will likely break a lot of
workflows), I am ok with the warning as well as a softer way of letting the
user know they are doing it wrong).
But ultimately, I'd really want we (re) consider if we cannot make it into
a "working" solution.

It should be possible IMHO to allow this usage with clever instrumentation
- the code that we have now in Dave's PR is already detecting most (all
that Elad mentioned as well including regular operators within regular
operators I think Dave?) cases like that.

It's really **only**  a matter of a) setting task_id to be parent's task if
b) getting context from parent operator c) running pre-processing of
templated fields (and in this case it can be done in the constructor of
such a nested operator - because we already have context.

This all seems doable from the technical point of view.

We do not even have to handle pushing the return value to Xcom (that would
be tricky with potentially returning value from the "upper" operator - but
we can add a warning about that one if it happens and such "nested"
operator has do_xcom_push and returns value from execute) and it should
just run "as expected".

I can't think of a big harm done this way to be honest - and it would make
life of our users way easier (and also live of those who happen to look at
the issues and discussion and attempt to help the users - because those
kinds of discussions and issues would not at all appear if this case will
**just work**(TM).

I think this is the case where "perceived correctness" trumps "harm done".
Unless of course I am missing some side effect here - which might well be I
miss - but no-one pointed to an actual harm it can do.

And to Elad point " "I know there is an operator that does X, so I will
just use it inside the python function I invoke from the python.operator".

Yes, many of our users might think this way. And I acknowledge that in this
case (providing no harm is done) - I prefer to adapt to the way my user
thinks, rather than force on them things that I think are the "ONLY" right
way of doing things. If it makes Airflow easier to use, and increases the
perception that it's a software written for imperfect humans who do not
always have time to read the documentation in detail and implement things
in the way that feels natural to them - I am all for it - voting with all
my hands and legs. - this means better Airflow adoption, better
word-of-mouth from our users, and most of all - our users losing as little
time as possible on unnecessary overhead (and us on responding to user's
worries).

J.

On Wed, Mar 20, 2024 at 9:37 AM Ash Berlin-Taylor  wrote:

> The reason users are sure they can use operators like that is that it has
> worked for a long time - hell I even wrote a custom nested operator in the
> past (pre 2.0 admittedly).
>
> So this pr should only be a warning by default, or a config option to warn
> but not error
>
> Alternatively do we just document it as "this is not recommended, and if
> it breaks you get to keep both halves"? (I'm not a fan of the runtime
> enforcement of that, it seems very "heavy" for minimal benefit or
> protection and limits what adventurous users can do)
>
> -a
>
> On 20 March 2024 07:05:25 GMT, Jarek Potiuk  wrote:
> >Just to add to the discussion - a discussion raised today
> >https://github.com/apache/airflow/discussions/38311  where the user is
> sure
> >that they can use operators in such a way as described above, and even
> used
> >the term "nested operator".
> >
> >I think getting  https://github.com/apache/airflow/pull/37937  in will
> be a
> >good way in the future to prevent this misunderstanding, but maybe there
> is
> >something to think about - in the "Operators need to die" context by
> Bolke.
> >
> >BTW. I have a hypothesis why those questions started to appear frequently
> >and people being reasonably sure they can do it. It's a pure speculation
> >(and I asked the user this time to explain) but some of that might be
> >fuelled by Chat GPT hallucinating about Airflow being able to do it. I saw
> >similar hallucinations before - where people suggested some (completely
> >wrong like that) solution to their problem and only after inquiry, they
> >admitted that it was a solution that ChatGPT gave them
> >
> >I wonder if we have even more of those soon.
> >
> >J.
> >
> >
> >On Sun, Mar 10, 2024 at 9:29 AM Elad Kalif  wrote:
> >
> >> The issue here is not just about decorators it happens also with regular
> >> operators (operator inside operator) and also with operator inside
> >> on_x_callb

Re: Python 3.12 support is here (!)

2024-03-20 Thread Jarek Potiuk
FYI. Python 3.12 is now fully back  - ready for 2.9.0.

We even got some of the providers that were excluded, back in:

* databricks is back in after the issue we had was determined as "Python
coverage" tool not working well yet for Python 3.12 (we disabled coverage
for Python 3.12 now)
* cassandra is about to be re-enabled
https://github.com/apache/airflow/pull/38314 after they responded to our
issue and released a binary driver today with libev support compiled in.

The two remaining providers that do not work for Python  3.12 are papermill
and apache-beam.
* Papermill - technically support is merged but we are waiting for the new
release of their package https://github.com/nteract/papermill/pull/771 - I
ping them today and there is some reaction, let's see.
* Beam (traditionally they have most complex dependencies and are lagging
behind) - issue here and I inquired them when support is planned
https://github.com/apache/beam/issues/29149

Just as a reminder: Airflow 2.9.0 default images will be *Python 3.12*  -
according to the rule we agreed here:
https://lists.apache.org/thread/0oxnvct24xlqsj76z42w2ttw2d043oy3 - this
means that papermill and beam will not be installable in "apache/airflow"
image or "apache/airflow:2.9.0" - users will have to use
"apache/airflow:2.9.0-python3.11" for example.

J.


On Mon, Mar 11, 2024 at 10:02 AM Jarek Potiuk  wrote:

> There were strange errors in the canary build which was not showing up in
> the canary - so I need to revert and investigate.
>
> On Mon, Mar 11, 2024 at 8:52 AM Jarek Potiuk  wrote:
>
>> And merged :)
>>
>> On Sun, Mar 10, 2024 at 6:52 PM Jarek Potiuk  wrote:
>>
>>> Hey here,
>>>
>>> FINALLY - after 5 months (a little more than I initially anticipated - I
>>> thought it will take 3-4 months) we can finally add Python 3.12.
>>>
>>> The time is about right - a day that we plan to cut the
>>> Airflow `v2-9-test` branch !!
>>>
>>> Thanks to Bolke for the final push on the Universal Pathlib migration
>>> (and Andreas Poehlmann who is the Universal Pathlib Maintainer). That was
>>> the final blocker that kept us from adding the support for 3.12 
>>>
>>> The PR is all "GREEN" https://github.com/apache/airflow/pull/36755 and
>>> waits for reviews :)
>>>
>>> ---
>>>
>>> A bit more details on 3.12 support:
>>>
>>>
>>> While Airflow fully supports 3.12, we had to exclude 3 providers from
>>> Python 3.12 support (temporarily - all of them will be included back when
>>> they support 3.12):
>>>
>>> * *apache.beam* (beam has no 3.12 support yet) and looking at the state
>>> of the ticket here we will wait a bit more
>>> https://github.com/apache/beam/issues/29149
>>>
>>> * *apache.cassandra* (the default setup for cassandra does not work
>>> with 3.12 and requires custom compilation) - they are working on releasing
>>> a build that works (the problem is their binary driver does not have the
>>> right library compiled in). Should be fixed in the next cassandra-driver
>>> release (3.30.0). Either because they fix their build environment to
>>> compile the libev support in
>>> https://datastax-oss.atlassian.net/browse/PYTHON-1378 or when they
>>> promote asyncio reactor to be "production ready":
>>> https://datastax-oss.atlassian.net/browse/PYTHON-1375
>>>
>>> * *papermill *(not sure if they will release a new version with  3.12
>>> support any time soon. The fix is already merged
>>> https://github.com/nteract/papermill/pull/771 - but the last release of
>>> papermill happened in November, and there is not much activity in the
>>> project.
>>>
>>> All the other providers seem to happily work in the Python 3.12
>>> environment.
>>>
>>> J.
>>>
>>>


Re: Bad mixing of decorated and classic operators (users shooting themselves in their foot)

2024-03-20 Thread Jarek Potiuk
Just to add to the discussion - a discussion raised today
https://github.com/apache/airflow/discussions/38311  where the user is sure
that they can use operators in such a way as described above, and even used
the term "nested operator".

I think getting  https://github.com/apache/airflow/pull/37937  in will be a
good way in the future to prevent this misunderstanding, but maybe there is
something to think about - in the "Operators need to die" context by Bolke.

BTW. I have a hypothesis why those questions started to appear frequently
and people being reasonably sure they can do it. It's a pure speculation
(and I asked the user this time to explain) but some of that might be
fuelled by Chat GPT hallucinating about Airflow being able to do it. I saw
similar hallucinations before - where people suggested some (completely
wrong like that) solution to their problem and only after inquiry, they
admitted that it was a solution that ChatGPT gave them

I wonder if we have even more of those soon.

J.


On Sun, Mar 10, 2024 at 9:29 AM Elad Kalif  wrote:

> The issue here is not just about decorators it happens also with regular
> operators (operator inside operator) and also with operator inside
> on_x_callback
>
> For example:
>
> https://stackoverflow.com/questions/64291042/airflow-call-a-operator-inside-a-function/
>
> https://stackoverflow.com/questions/67483542/airflow-pythonoperator-inside-pythonoperator/
>
>
>
> > I can't see which problem is solved by allowing running one operator
> inside another.
>
> From the user's perspective, they have an operator that knows how to do
> something and it's very easy to use. So they want to leverage that.
> For example send Slack message:
>
> slack_operator_post_text = SlackAPIPostOperator(
> task_id="slack_post_text",
> channel=SLACK_CHANNEL,
> text=("My message"),
> )
>
> It handles everything. Now if you want to send a Slack message from a
> PythonOperator you need to initialize a hook, find the right function to
> invoke etc.
> Thus from the user perspective - There is already a class that does all
> that. Why can't it just work? Why do they need to "reimplement" the
> operator logic? (most of the time it will be copy paste the logic of the
> execute function)
>
> So, the problem they are trying to solve is to avoid code duplication and
> ease of use.
>
> Jarek - I think your solution focuses more on the templating side but I
> think the actual problem is not limited to the templating.
> I think the problem is more of "I know there is an operator that does X, so
> I will just use it inside the python function I invoke from the python
> operator" - regardless of whether Jinja/templating becomes an issue or not.
>
> On Sat, Mar 9, 2024 at 9:06 PM Jarek Potiuk  wrote:
>
> > I see that we have already (thanks David!) a PR:
> > https://github.com/apache/airflow/pull/37937 to forbid this use (which
> is
> > cool and I am glad my discussion had some ripple effect :D ).
> >
> > I am quite happy to get this one merged once it passes tests/reviews,
> but I
> > would still want to explore future departure / options we might have,
> maybe
> > there will be another - long term - ripple effect :). I thought a bit
> more
> > about  - possibly - different reasons why this pattern we observe is
> > emerging and I have a theory.
> >
> > To Andrey's comments:
> >
> > > I can't see which problem is solved by allowing running one operator
> > inside another.
> >
> > For me, the main problem to solve is that using Hooks in the way I
> > described in
> >
> >
> https://medium.com/apache-airflow/generic-airflow-transfers-made-easy-5fe8e5e7d2c2
> > in 2022 are almost non-discoverable by significant percentage of users.
> > Especially those kinds of users that mostly treat Airflow Operators as
> > black-box and **just** discovered task flow as a way that they can do
> > simple things in Python - but they are not into writing their own custom
> > operators, nor look at the operator's code. Generally they don't really
> see
> > DAG authoring as writing Python Code, it's mostly about using a little
> > weird DSL to build their DAGs. Mostly copy some constructs that
> > look like putting together existing building blocks and using patterns
> like
> > `>>` to add dependencies.
> >
> > Yes I try to be empathetic and try to guess how such users think about
> DAG
> > authoring - I might be wrong, but this is what I see as a recurring
> > pattern.
> >
> > So in this context - @task is not Python code writing, it's yet another
> DSL
> > t

[DISCUSS] Applying D105 rule for our codebase ("undocumented magic methods") ?

2024-03-19 Thread Jarek Potiuk
Hey here,

I wanted to quickly poll what people think about applying the
https://docs.astral.sh/ruff/rules/undocumented-magic-method/ rule in our
codebase. There are many uncontroversial rules - but that one is somewhat
more controversial than others.

See https://github.com/apache/airflow/pull/37602#issuecomment-2001951402
and
https://github.com/apache/airflow/pull/38277#pullrequestreview-1945745542
for example

I think that even in the ruff example, but also in many cases requiring to
document the methods will lead to rather useless documentation:

class Cat(Animal):
def __str__(self) -> str:
"""Return a string representation of the cat."""
return f"Cat: {self.name}"

There is IMHO very little value in having such documentation. It might be
useful in some cases where we have a really good reason to add such a magic
method and it is important to document it, but in many cases - the
documentation will be just documenting what the magic method already
explains well (like the case above).

This actually reminds me the early days of java documentation where javadoc
looks more or less like this:

"Paints the object"
func paint()

"Repaints the object"
fund repaint()

However - maybe I am wrong :). Maybe it's worth documenting those methods
in bulk, even if in many cases it will not bring much value?

WDYT ? Should we mandate documenting it - or leave up to the author to
document it in case it feels like it is needed?

J.


Re: [VOTE] Remove experimental API

2024-03-16 Thread Jarek Potiuk
Would you still vote `-1`  of course was the question.

On Sat, Mar 16, 2024 at 8:37 PM Jarek Potiuk  wrote:

> Question: Jed, Ash: Would you still vote If we move it to provider (with
> status "removed from maintenance except security fixes" - same as we did
> with daskexecutor?
>
> J.
>
> On Sat, Mar 16, 2024 at 8:25 PM Ash Berlin-Taylor  wrote:
>
>> As much as I would love to remove it I'm with Jed: if it worked on 2.0 it
>> should work on all 2.x
>>
>> My vote is -1
>>
>> On 16 March 2024 19:08:13 GMT, Jed Cunningham 
>> wrote:
>> >I forgot to add the "why" - I view this as a breaking change still.
>>
>


Re: [VOTE] Remove experimental API

2024-03-16 Thread Jarek Potiuk
Question: Jed, Ash: Would you still vote If we move it to provider (with
status "removed from maintenance except security fixes" - same as we did
with daskexecutor?

J.

On Sat, Mar 16, 2024 at 8:25 PM Ash Berlin-Taylor  wrote:

> As much as I would love to remove it I'm with Jed: if it worked on 2.0 it
> should work on all 2.x
>
> My vote is -1
>
> On 16 March 2024 19:08:13 GMT, Jed Cunningham 
> wrote:
> >I forgot to add the "why" - I view this as a breaking change still.
>


Re: [VOTE] Remove experimental API

2024-03-16 Thread Jarek Potiuk
+1 (binding)

On Sat, Mar 16, 2024 at 1:14 PM Andrey Anshin 
wrote:

> Greetings everyone!
>
> I would like to start a vote proces about removal of Experimental API
> support into the next minor Airflow version, presumably 2.9, but it could
> be postponed to 2.10.
>
> By default experimental REST API turned off, and we recommend to use stable
> REST API:
>
> https://airflow.apache.org/docs/apache-airflow/stable/deprecated-rest-api-ref.html
>
> Discussion about deprecate and remove support of Experimental API started
> in 1.10
> - Dev List:
> https://lists.apache.org/thread/jdz9l7bsnsw5c3t27dxfrx5pd4wvjlxt
> - Github Issues: https://github.com/apache/airflow/issues/10552
>
> And recently:
> - https://lists.apache.org/thread/khl7gvzpcv3kn99zc441wb9m2dyz4gp9
>
> The vote will last until 13:00 GMT/UTC on March 22, 2024, and until at
> least 3 binding votes have been cast.
>
> Please vote accordingly:
>
> [ ] + 1 approve
> [ ] + 0 no opinion
> [ ] - 1 disapprove with the reason
>
> Only votes from PMC members and committers are binding, but other members
> of the community are encouraged to check the AIP and vote with
> "(non-binding)".
>


Re: [DISCUSS] DRAFT AIP-67 Multi-tenant deployment of Airflow components

2024-03-15 Thread Jarek Potiuk
d user-friendly.
>
> *Image:* https://imgur.com/gallery/uQNqiVc (highly recommend reviewing
> the image to understand the underlying setup)
>
> *---*
>
> I’d suggest that interested Airflow users review the scenario and share
> your support or concerns on this concept in this thread or AIP. For those
> interested in diving deeper into the details, the AIP is available here -
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-tenant+deployment+of+Airflow+components
>
>
> Thanks
> Shubham
> Product Manager - Amazon MWAA
>
>
>
> *From: *Jarek Potiuk 
> *Reply-To: *"us...@airflow.apache.org" 
> *Date: *Monday, March 11, 2024 at 4:05 PM
> *To: *"dev@airflow.apache.org" , "
> us...@airflow.apache.org" 
> *Subject: *RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] DRAFT AIP-67
> Multi-tenant deployment of Airflow components
>
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> *AVERTISSEMENT*: Ce courrier électronique provient d’un expéditeur
> externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous
> ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> certain que le contenu ne présente aucun risque.
>
>
>
> I have iterated and already got a LOT of comments from a LOT of people
> (Thanks everyone who spent time on it ). I'd say the document is out of
> draft already, it very much describes the idea of multi-tenancy that I hope
> we will be voting on some time in the future.
>
>
>
> Taking into account that ~ 30% of people in our survey said they want
> "mutl-tenancy" -  what I am REALLY interested in is to get honest feedback
> about the proposal. Manly:
>
>
>
> **"Is this the multi-tenancy you were looking for?" *
>
>
>
> Or were you looking for different droids (err, tenants) maybe?.
>
>
>
> I do not want to exercise my Jedi skills to influence your opinion, that's
> why the document is there (and some people say it's nice, readable and
> pretty complete) so that you can judge yourself and give feedback.
>
>
>
> The document is here:
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-tenant+deployment+of+Airflow+components
>
>
>
>
> Feel free to comment here, or in the document. I would love to hear more
> voices, and have some ideas what to do next to validate the idea, so please
> - engage for now - but also expect some follow-ups.
>
>
>
> J.
>
>
>
>
>
> On Wed, Mar 6, 2024 at 9:16 AM Jarek Potiuk  wrote:
>
> Sooo.. Seems that it's an AIP time :D I've just published a Draft of
> AIP-67:
>
> Multi-tenant deployment of Airflow components
>
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BDRAFT%5D+AIP-67+Multi-tenant+deployment+of+Airflow+components
>
> This AIP  is a bit lighter in detail than the others you could see
> from Jed , Nikolas and Maciej. This is really a DRAFT / High Level
> idea of Multi-Tenancy that could be implemented as the follow-up after
> previous steps of Multi-Tenancy implemented (or being implemented)
> right now.
>
> I decided to - rather than describe all the details now -  focus on
> the concept of Multitenancy that I wanted to propose. Most of all
> explaining the concept, comparing it to current ways of achieving some
> forms of multi-tenancy and showing benefits and drawbacks of the
> solution and connected costs (i.e. what complexity we need to add to
> achieve it).
>
> When thinking about Multi-tenancy, I realized few things:
>
> * everyone might understand multi-tenancy differently
> * some forms of multi-tenancy are achievable even today
> * but - most of all - I started to question myself "Is this what we
> can do, enough for some, sufficiently numerous groups of users to call
> it a useful feature for them".
>
> So before we get into more details - my aim is to make sure we are all
> at the same page on what we CAN do as a multi-tenancy, and eventually
> to decide whether we SHOULD do it.
>
> Have fun. Bring in comments and feedback.
>
> More about all the currently active AIPs at today's Town Hall
>
> BTW. Do you think it's a surprise that 5 AIPS were announced just
> before the Town Hall? I think not  :D
>
> J.
>
>


Re: [DISCUSS] DRAFT AIP-67 Multi-tenant deployment of Airflow components

2024-03-15 Thread Jarek Potiuk
;> -Adam
>> 
>> From: Mehta, Shubham 
>> Sent: Thursday, March 14, 2024 3:21:12 PM
>> To: us...@airflow.apache.org ;
>> dev@airflow.apache.org 
>> Subject: Re: [DISCUSS] DRAFT AIP-67 Multi-tenant deployment of Airflow
>> components
>>
>> Hi folks,
>>
>> Firstly, thanks Jarek for putting together such a thorough and
>> well-thought-out proposal.
>>
>> I am very much in support of the multi-tenancy proposal. Having discussed
>> this with over 30 customers (AWS and non-AWS), there's a clear desire to
>> shift focus from the complex management of multiple Airflow environments to
>> enhancing their capabilities, such as enabling data quality checks and
>> lineage. This proposal is a significant step towards achieving that goal.
>>
>> Acknowledging that not every Airflow user has enough time to thoroughly
>> review the AIP, I have drafted a user scenario that encapsulates what's
>> possible with the implementation of multi-tenancy support:
>>
>>  Scenario: Multi-Tenancy in Apache Airflow at [Rocket] 
>> [Rocket], a leading [mobile gaming platform], has adeptly structured its
>> cloud operations using Apache Airflow to provide an efficient and secure
>> multi-tenant environment for orchestrating their complex workflows. This
>> approach caters to the diverse needs of their three main user groups: the
>> Data Engineering team, the Data Science team, and the Data Analytics team.
>>
>> All teams share basic Airflow components like the Scheduler and
>> Webserver, providing centralized management with shared cost. Each team has
>> its own distinct tenant cluster, offering self-sufficiency, flexibility,
>> and isolation. The Data Engineering team builds ETL/ELT pipelines and
>> produces user profile, telemetry, and marketing data. The Analytics team
>> works with marketing data and user information to build comprehensive
>> dashboards. The Data Science team uses Kubernetes as their execution
>> environment for heavy-duty machine learning tasks, producing a churn
>> prediction dataset.
>>
>> Members of each team can only see and work with their own workflows.
>> However, Data engineers are granted access to all tenants, enabling them to
>> assist with DAG troubleshooting and optimization across all teams. Upon
>> logging in, users are presented with a tenant-specific view, displaying
>> only the relevant DAGs and artifacts. For those with multi-tenant access,
>> seamless navigation between different tenant views is available without the
>> need for re-authentication.
>>
>> This setup lets each team work independently with their own tools and
>> data, while also getting help from data engineers when needed. It's secure,
>> efficient, and user-friendly.
>>
>> Image:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2Fgallery%2FuQNqiVc=05%7C02%7Cadam%40runbymany.com%7Ca5ad7fc69319431a5e7b08dc445bf26b%7C33632a8512c2443d9b064cfa3bf99965%7C0%7C0%7C638460408922495976%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=sLtUKZXCuHm1FKYE40Z6hx1ovbBjBfy14jhPO89BtYk%3D=0
>> <https://imgur.com/gallery/uQNqiVc> (highly recommend reviewing the
>> image to understand the underlying setup)
>>
>> ---
>>
>> I’d suggest that interested Airflow users review the scenario and share
>> your support or concerns on this concept in this thread or AIP. For those
>> interested in diving deeper into the details, the AIP is available here -
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-67%2BMulti-tenant%2Bdeployment%2Bof%2BAirflow%2Bcomponents=05%7C02%7Cadam%40runbymany.com%7Ca5ad7fc69319431a5e7b08dc445bf26b%7C33632a8512c2443d9b064cfa3bf99965%7C0%7C0%7C638460408922503020%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=xCpLUSx48MIbsM3PhdCVLC4K2xMIgZPr1oSsTn0WX8M%3D=0
>> <
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-tenant+deployment+of+Airflow+components
>> >
>>
>> Thanks
>> Shubham
>> Product Manager - Amazon MWAA
>>
>> From: Jarek Potiuk 
>> Reply-To: "us...@airflow.apache.org" 
>> Date: Monday, March 11, 2024 at 4:05 PM
>> To: "dev@airflow.apache.org" , "
>> us...@airflow.apache.org" 
>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] DRAFT AIP-67
>> Multi-tenant deployment of 

Re: [VOTE] AIP-59 Performance testing framework

2024-03-13 Thread Jarek Potiuk
+1 (binding)

On Wed, Mar 13, 2024 at 7:40 AM Bartosz Jankiewicz
 wrote:
>
> Hi folks,
>
> The AIP for performance testing has been in review for quite some time and
> I've included your feedback in the document.
>
> I'd like to call a vote, and if you agree I'd start a development of the
> framework.
>
> The AIP can be found below:
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-59+Performance+tests+framework
>
> Please vote accordingly:
>
> [ ] + 1 approve
> [ ] + 0 no opinion
> [ ] - 1 disapprove with the reason
>
> Thank you!

-
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org



[ANNOUNCEMENT] Updates to just released Airlfow 2.8.3 constraints and images

2024-03-12 Thread Jarek Potiuk
Hello everyone,

As discussed earlier at the #release-management channel  in Slack (and
I am bringing it now o devlist) - the Airflow 2.8.3 images released
yesterday had been updated today (following changes to constraints
resulting from some early problem reports we found). In case you used
a 2.8.3 image, you should pull it again to get the latest version.

Those changes do not release a new software, they merely update the
frozen "golden" set of constraints so there was no need to vote or
release anything.

The change affected 2.8.3 image: it contains providers released few
days ago and new/updated dependencies but the most important changes
are:

* pandas in all images for all python versions is now downgraded to <
2.2 (2.2. conflicts with sqlalchemy) - see
https://github.com/apache/airflow/pull/37748

* smtp provider is now upgraded in constraints to 1.6.1 (1.6.0 is
yanked as it contained a breaking change)
https://github.com/apache/airflow/pull/37701

Just as a reminder - as of Airflow 2.8.3 image also does not have
`gosu` binary. installed by default. This binary was used in the past
by some of the scripts but it's not used any more by us, and it added
some security vulnerabilities - when the image was scanned, that's why
we removed it.

If you need it you will need to install it on your own.

Details about the constraints changed:
https://github.com/apache/airflow/commit/89b5d00a24e8bea3203ab288bf03834564e5c51b

I've updated the documentation of the image (I cherry-picked and
documentation on our website is being rebuilt):
https://github.com/apache/airflow/pull/38085 - they should be live
soon.

J.

-
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org



Re: [DISCUSS] DRAFT AIP-67 Multi-tenant deployment of Airflow components

2024-03-11 Thread Jarek Potiuk
I have iterated and already got a LOT of comments from a LOT of people
(Thanks everyone who spent time on it ). I'd say the document is out of
draft already, it very much describes the idea of multi-tenancy that I hope
we will be voting on some time in the future.

Taking into account that ~ 30% of people in our survey said they want
"mutl-tenancy" -  what I am REALLY interested in is to get honest feedback
about the proposal. Manly:

**"Is this the multi-tenancy you were looking for?" *

Or were you looking for different droids (err, tenants) maybe?.

I do not want to exercise my Jedi skills to influence your opinion, that's
why the document is there (and some people say it's nice, readable and
pretty complete) so that you can judge yourself and give feedback.

The document is here:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-tenant+deployment+of+Airflow+components


Feel free to comment here, or in the document. I would love to hear more
voices, and have some ideas what to do next to validate the idea, so please
- engage for now - but also expect some follow-ups.

J.


On Wed, Mar 6, 2024 at 9:16 AM Jarek Potiuk  wrote:

> Sooo.. Seems that it's an AIP time :D I've just published a Draft of
> AIP-67:
>
> Multi-tenant deployment of Airflow components
>
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BDRAFT%5D+AIP-67+Multi-tenant+deployment+of+Airflow+components
>
> This AIP  is a bit lighter in detail than the others you could see
> from Jed , Nikolas and Maciej. This is really a DRAFT / High Level
> idea of Multi-Tenancy that could be implemented as the follow-up after
> previous steps of Multi-Tenancy implemented (or being implemented)
> right now.
>
> I decided to - rather than describe all the details now -  focus on
> the concept of Multitenancy that I wanted to propose. Most of all
> explaining the concept, comparing it to current ways of achieving some
> forms of multi-tenancy and showing benefits and drawbacks of the
> solution and connected costs (i.e. what complexity we need to add to
> achieve it).
>
> When thinking about Multi-tenancy, I realized few things:
>
> * everyone might understand multi-tenancy differently
> * some forms of multi-tenancy are achievable even today
> * but - most of all - I started to question myself "Is this what we
> can do, enough for some, sufficiently numerous groups of users to call
> it a useful feature for them".
>
> So before we get into more details - my aim is to make sure we are all
> at the same page on what we CAN do as a multi-tenancy, and eventually
> to decide whether we SHOULD do it.
>
> Have fun. Bring in comments and feedback.
>
> More about all the currently active AIPs at today's Town Hall
>
> BTW. Do you think it's a surprise that 5 AIPS were announced just
> before the Town Hall? I think not  :D
>
> J.
>


Re: [PROPOSAL] Adding MSGraphSDK Async Operator to Airflow

2024-03-11 Thread Jarek Potiuk
Then - just add it as another operator in Azure :).

It's not more than yet-another-operator for Azure service. It would be
a different story if it was a different "service" - with completely
different stakeholders and maintenance burden involved. Being a "just
another Azure operator" it adds very little overhead and generally we
do not need to discuss it - just getting good quality PR with tests,
docs etc. is enough.

On Mon, Mar 11, 2024 at 6:27 PM Blain David  wrote:
>
> Hello Jarek,
>
> There is no particular need to add this into a separate provider, I just did 
> it as I wanted to deploy it myself, it could perfectly reside within the 
> Microsoft Azure provider one.
>
> There is nothing special about it, except it depends on the msgraph-core 
> dependency (which is light and doesn't have the generated client code the 
> msgraph-sdk has which we don't need in Airflow) which facilitates the REST 
> calls.
> In my first POC it used the full msgraph-sdk dependency, which did not make 
> sense unless you would use the full blow client through the Hook in Airflow.
> But as the main focus of this provider was to avoid as much as possible 
> custom python code in our DAG's, it made no sense.
> The azure-identity dependency is also needed but that one is already present 
> in Airflow as the Microsoft Azure provider already uses it for the Fabric 
> hooks.
> So indeed, it's just another hook, triggerer and operator, with the 
> particularity it only works in async (e.g. deferred) mode.
>
> What do you think?
>
> Kind regards,
> David
>
> -Original Message-
> From: Jarek Potiuk 
> Sent: Sunday, 10 March 2024 18:19
> To: dev@airflow.apache.org
> Cc: Bienvenu Joffrey 
> Subject: Re: [PROPOSAL] Adding MSGraphSDK Async Operator to Airflow
>
> EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet 
> vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur 
> deze e-mail als bijlage naar ab...@infrabel.be<mailto:ab...@infrabel.be>.
>
> Question - is there any reason you want to add it as a separate provider and 
> not just another operator in the azure provider ? I looked at the code and 
> it's not a lot, and I see no particular reason why it should not be simply 
> yet another operator/Hook there. Just as many other operators - I was under 
> the impression there is something special about the SDK you use or 
> "proprietaredness" (for lack of a better word) - but that seems like 
> yet-another operator, hook, triggerer in `microsoft.azure`.
>
> Or am I missing something?
>
> J.
>
> On Fri, Mar 1, 2024 at 8:57 AM Blain David  wrote:
>
> > Hello Jarek and Elad,
> >
> > Indeed maintenance could be a concern, but I think that's already the
> > same case regarding all other Microsoft related providers in Airflow.
> > Also know that I also already contributed to the Microsoft Graph
> > Python SDK project on the Microsoft repo:
> > https://gith/
> > ub.com%2Fmicrosoftgraph%2Fmsgraph-sdk-python=05%7C02%7Cdavid.blai
> > n%40infrabel.be%7C049ea8fb06b54fbd934f08dc412649ed%7Cb82bc314ab8e4d6fb
> > 18946f02e1f27f2%7C0%7C0%7C638456879909688297%7CUnknown%7CTWFpbGZsb3d8e
> > yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%
> > 7C%7C%7C=l7lEGyQ3IGHA7wOA4MFmDaQDLtF5PE3dSmxFWZEXgdc%3D
> > =0 I also already contributed to the Airflow providers fixing bugs and
> > doing enhancements, so there is a commitment on my part also 
> > The hook, operator and trigger are well tested, I always try to
> > deliver code that is tested as much as possible, we have a score of
> > 90% on our sonarqube and only a technical debt of 26 mins (which will
> > probably decrease).
> >
> > In the meantime I've been finetuning and testing the operator in our
> > environment, we now have multiple DAG's using it.
> >
> > Bellow some advantages of using the MSGraphSDKOperator:
> > - handles and refreshes bearer tokens automatically thanks to the
> > azure provider classes by Microsoft, which you don't need to maintain
> > you just use it.
> > - the operator is fully async using triggerers so requests and
> > responses are handled by separate workers, which is non-blocking
> > resources for nothing, which is good as the Microsoft implementation
> > only allows async calls anyway.
> > - you don't need to worry about paging, this is also handled
> > automatically asynchronously for you as Microsoft is following the
> > OData spec, the operator can handle this in a generic way (
> > https://learn.microsoft.com/en-us/odata/client/pagination).
> > - uses the newer httpx library
> > (https:/

Re: Python 3.12 support is here (!)

2024-03-11 Thread Jarek Potiuk
There were strange errors in the canary build which was not showing up in
the canary - so I need to revert and investigate.

On Mon, Mar 11, 2024 at 8:52 AM Jarek Potiuk  wrote:

> And merged :)
>
> On Sun, Mar 10, 2024 at 6:52 PM Jarek Potiuk  wrote:
>
>> Hey here,
>>
>> FINALLY - after 5 months (a little more than I initially anticipated - I
>> thought it will take 3-4 months) we can finally add Python 3.12.
>>
>> The time is about right - a day that we plan to cut the
>> Airflow `v2-9-test` branch !!
>>
>> Thanks to Bolke for the final push on the Universal Pathlib migration
>> (and Andreas Poehlmann who is the Universal Pathlib Maintainer). That was
>> the final blocker that kept us from adding the support for 3.12 
>>
>> The PR is all "GREEN" https://github.com/apache/airflow/pull/36755 and
>> waits for reviews :)
>>
>> ---
>>
>> A bit more details on 3.12 support:
>>
>>
>> While Airflow fully supports 3.12, we had to exclude 3 providers from
>> Python 3.12 support (temporarily - all of them will be included back when
>> they support 3.12):
>>
>> * *apache.beam* (beam has no 3.12 support yet) and looking at the state
>> of the ticket here we will wait a bit more
>> https://github.com/apache/beam/issues/29149
>>
>> * *apache.cassandra* (the default setup for cassandra does not work with
>> 3.12 and requires custom compilation) - they are working on releasing a
>> build that works (the problem is their binary driver does not have the
>> right library compiled in). Should be fixed in the next cassandra-driver
>> release (3.30.0). Either because they fix their build environment to
>> compile the libev support in
>> https://datastax-oss.atlassian.net/browse/PYTHON-1378 or when they
>> promote asyncio reactor to be "production ready":
>> https://datastax-oss.atlassian.net/browse/PYTHON-1375
>>
>> * *papermill *(not sure if they will release a new version with  3.12
>> support any time soon. The fix is already merged
>> https://github.com/nteract/papermill/pull/771 - but the last release of
>> papermill happened in November, and there is not much activity in the
>> project.
>>
>> All the other providers seem to happily work in the Python 3.12
>> environment.
>>
>> J.
>>
>>


Re: Python 3.12 support is here (!)

2024-03-11 Thread Jarek Potiuk
And merged :)

On Sun, Mar 10, 2024 at 6:52 PM Jarek Potiuk  wrote:

> Hey here,
>
> FINALLY - after 5 months (a little more than I initially anticipated - I
> thought it will take 3-4 months) we can finally add Python 3.12.
>
> The time is about right - a day that we plan to cut the
> Airflow `v2-9-test` branch !!
>
> Thanks to Bolke for the final push on the Universal Pathlib migration (and
> Andreas Poehlmann who is the Universal Pathlib Maintainer). That was the
> final blocker that kept us from adding the support for 3.12 
>
> The PR is all "GREEN" https://github.com/apache/airflow/pull/36755 and
> waits for reviews :)
>
> ---
>
> A bit more details on 3.12 support:
>
>
> While Airflow fully supports 3.12, we had to exclude 3 providers from
> Python 3.12 support (temporarily - all of them will be included back when
> they support 3.12):
>
> * *apache.beam* (beam has no 3.12 support yet) and looking at the state
> of the ticket here we will wait a bit more
> https://github.com/apache/beam/issues/29149
>
> * *apache.cassandra* (the default setup for cassandra does not work with
> 3.12 and requires custom compilation) - they are working on releasing a
> build that works (the problem is their binary driver does not have the
> right library compiled in). Should be fixed in the next cassandra-driver
> release (3.30.0). Either because they fix their build environment to
> compile the libev support in
> https://datastax-oss.atlassian.net/browse/PYTHON-1378 or when they
> promote asyncio reactor to be "production ready":
> https://datastax-oss.atlassian.net/browse/PYTHON-1375
>
> * *papermill *(not sure if they will release a new version with  3.12
> support any time soon. The fix is already merged
> https://github.com/nteract/papermill/pull/771 - but the last release of
> papermill happened in November, and there is not much activity in the
> project.
>
> All the other providers seem to happily work in the Python 3.12
> environment.
>
> J.
>
>


Python 3.12 support is here (!)

2024-03-10 Thread Jarek Potiuk
Hey here,

FINALLY - after 5 months (a little more than I initially anticipated - I
thought it will take 3-4 months) we can finally add Python 3.12.

The time is about right - a day that we plan to cut the Airflow `v2-9-test`
branch !!

Thanks to Bolke for the final push on the Universal Pathlib migration (and
Andreas Poehlmann who is the Universal Pathlib Maintainer). That was the
final blocker that kept us from adding the support for 3.12 

The PR is all "GREEN" https://github.com/apache/airflow/pull/36755 and
waits for reviews :)

---

A bit more details on 3.12 support:


While Airflow fully supports 3.12, we had to exclude 3 providers from
Python 3.12 support (temporarily - all of them will be included back when
they support 3.12):

* *apache.beam* (beam has no 3.12 support yet) and looking at the state of
the ticket here we will wait a bit more
https://github.com/apache/beam/issues/29149

* *apache.cassandra* (the default setup for cassandra does not work with
3.12 and requires custom compilation) - they are working on releasing a
build that works (the problem is their binary driver does not have the
right library compiled in). Should be fixed in the next cassandra-driver
release (3.30.0). Either because they fix their build environment to
compile the libev support in
https://datastax-oss.atlassian.net/browse/PYTHON-1378 or when they promote
asyncio reactor to be "production ready":
https://datastax-oss.atlassian.net/browse/PYTHON-1375

* *papermill *(not sure if they will release a new version with  3.12
support any time soon. The fix is already merged
https://github.com/nteract/papermill/pull/771 - but the last release of
papermill happened in November, and there is not much activity in the
project.

All the other providers seem to happily work in the Python 3.12 environment.

J.


Re: [DISCUSS] Common.util provider?

2024-03-10 Thread Jarek Potiuk
Coming back to it - what about the "polypill" :)? What's different vs the
"common.sql" way of doing it ? How do we think it can work ?

On Thu, Feb 22, 2024 at 1:58 PM Jarek Potiuk  wrote:

> > The symbolic link approach seems to disregard all the external
> providers, unless I misunderstand it.
>
> Not really. It just does not make it easy for the external providers to
> use it "fast".  They can still - if they want to just manually copy those
> utils from the latest version of Airflow if they want to use it. Almost by
> definition, those will be small, independent modules that can be simply
> copied as needed by whoever releases external providers - and they are also
> free to copy any older version if they want. That is a nice feature that
> makes them fully decoupled from the version of Airflow they are installed
> in (same as community providers). Or - if they want they can just import
> them from "airflow.provider_utils" - but then they have to add >= Airflow
> 2.9 if that util appeared in Airflow 2.9 (which is the main reason we want
> to use symbolic links - because due to our policies and promises, we do not
> want community providers to depend on latest version of Airflow in vast
> majority of cases.
>
> So this approach is also fully usable by external providers, but it
> requires some manual effort to copy the modules to their providers.
>
> > I like the polypill idea. A backport provider that brings new interfaces
> to providers without the actual functionalities.
>
> I would love to hear more about this, I think the "common.util" was
> exactly the kind of polyfill approach (with its own versioning
> complexities) but maybe I do not understand how such a polyfill provider
> would work. Say we want to add a new "urlparse" method usable for all
> providers. Could you explain how it would work - say:
>
> * we add "urlparse" in Airflow 2.9
> * some provider wants to use it in Airflow 2.7
>
> What providers, with what code/interfaces we would have to release in this
> case and what dependencies such providers that want to use it (both
> community and Airflow should have)? I **think** that would mean exactly the
> "common." approach we already have with "io" and "sql", but
> maybe I do not understand it :)
>
> On Thu, Feb 22, 2024 at 1:45 PM Tzu-ping Chung 
> wrote:
>
>> I like the polypill idea. A backport provider that brings new interfaces
>> to providers without the actual functionalities.
>>
>>
>> > On 22 Feb 2024, at 20:41, Maciej Obuchowski 
>> wrote:
>> >
>> >> That's why I generally do
>> > not like the "util" approach because common packaging introduces
>> > unnecessary coupling (you have to upgrade independent utils together).
>> >
>> > From my experience with releasing OpenLineage where we do things
>> similarly:
>> > I think that's the advantage of it, but only _if_ you can release those
>> > together.
>> > With "build-in" providers it makes sense, but could be burdensome if
>> > "external"
>> > ones would depend on that functionality.
>> >
>> >> I know it's not been the original idea behind providers, but - after
>> > testing common.sql and now also having common.io, seems like more and
>> more
>> > we would like to extract some common code that we would like providers
>> to
>> > use, but we refrain from it, because it will only be actually usable 6
>> > months after we introduce some common code.
>> >
>> > So, maybe better approach would be to introduce the functionality into
>> > core,
>> > and use common.X provider as "polyfill" (to borrow JS nomenclature)
>> > to make sure providers could use that functionality now, where external
>> > ones could depend on the Airflow ones?
>> >
>> > The symbolic link approach seems to disregard all the external
>> providers,
>> > unless
>> > I misunderstand it.
>> >
>> > czw., 22 lut 2024 o 13:28 Jarek Potiuk  napisał(a):
>> >
>> >>> Ideally utilities for each purpose (parsing URI, reading Object
>> Storage,
>> >> reading SQL, etc.) should each have its own utility package, so they
>> can be
>> >> released independently without dependency problems popping up if we
>> need to
>> >> break compatibility to one purpose. But more providers are
>> exponentially
>> >> more difficult to maintain, so I’d settle for one utility provider for
>> now
>&

Re: [PROPOSAL] Adding MSGraphSDK Async Operator to Airflow

2024-03-10 Thread Jarek Potiuk
Question - is there any reason you want to add it as a separate provider
and not just another operator in the azure provider ? I looked at the code
and it's not a lot, and I see no particular reason why it should not be
simply yet another operator/Hook there. Just as many other operators - I
was under the impression there is something special about the SDK you use
or "proprietaredness" (for lack of a better word) - but that seems like
yet-another operator, hook, triggerer in `microsoft.azure`.

Or am I missing something?

J.

On Fri, Mar 1, 2024 at 8:57 AM Blain David  wrote:

> Hello Jarek and Elad,
>
> Indeed maintenance could be a concern, but I think that's already the same
> case regarding all other Microsoft related providers in Airflow.
> Also know that I also already contributed to the Microsoft Graph Python
> SDK project on the Microsoft repo:
> https://github.com/microsoftgraph/msgraph-sdk-python
> I also already contributed to the Airflow providers fixing bugs and doing
> enhancements, so there is a commitment on my part also 
> The hook, operator and trigger are well tested, I always try to deliver
> code that is tested as much as possible, we have a score of 90% on our
> sonarqube and only a technical debt of 26 mins (which will probably
> decrease).
>
> In the meantime I've been finetuning and testing the operator in our
> environment, we now have multiple DAG's using it.
>
> Bellow some advantages of using the MSGraphSDKOperator:
> - handles and refreshes bearer tokens automatically thanks to the azure
> provider classes by Microsoft, which you don't need to maintain you just
> use it.
> - the operator is fully async using triggerers so requests and responses
> are handled by separate workers, which is non-blocking resources for
> nothing, which is good as the Microsoft implementation only allows async
> calls anyway.
> - you don't need to worry about paging, this is also handled automatically
> asynchronously for you as Microsoft is following the OData spec, the
> operator can handle this in a generic way (
> https://learn.microsoft.com/en-us/odata/client/pagination).
> - uses the newer httpx library (https://github.com/encode/httpx), which
> in our case as we are behind a corporate proxy, solves a lot of proxy
> related connection issues which we encounter with the requests library but
> don't when using httpx.
> - only depends on the msgraph-core and the kiota_abstractions library from
> Microsoft (https://github.com/microsoft/kiota-abstractions-python), which
> are the foundation libs on which their full msgraph_sdk dependency is
> build, which we don't need as this is only useful when you want to use
> their generated Python client, which doesn't make sense in Airflow.
> - as it's Microsoft, it's also compatible calling the PowerBI API, we
> already had multiple DA'sG using this operator to call dataset refreshes on
> the PowerBI REST API.
> - probably the Intune API will also work with this operator, I'm going to
> test this soon once we are migrating all our intune related custom jobs to
> Airflow DAG's.
> - another advantage is that all parameters can be defined in a Http
> Airflow connection, independently if you want to interact with MS Graph or
> PowerBI.
> - as mentioned before, the hook, operator and trigger are well tested, we
> have a score of 90% on our sonar and only a technical debt of 26 mins.
>
> The operator could be beneficial for everyone, as it avoids needing custom
> code to achieve the same using the regular HttpOperator, which I think
> could also need refactoring and maybe should be migrated using the httpx
> library.
> I'm willing to contribute there also, maybe we can come to a point that
> the MSGraphSDKOperator could be based on the refactored HttpOperator, in
> the meantime it would be nice if the operator would be already available in
> Airflow
>
> What do you guys think?
>
> Kind regards,
> David
>
>
> -Original Message-
> From: Jarek Potiuk 
> Sent: Friday, 26 January 2024 17:25
> To: dev@airflow.apache.org
> Subject: Re: [PROPOSAL] Adding MSGraphSDK Async Operator to Airflow
>
> [You don't often get email from ja...@potiuk.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze
> niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel,
> stuur deze e-mail als bijlage naar ab...@infrabel.be ab...@infrabel.be>.
>
> Hey David - any comments on that ?
>
> On Mon, Jan 15, 2024 at 1:07 PM Elad Kalif  wrote:
>
> > Hi David,
> >
> > Thanks for raising this discussion.
> > following the protocol established about accepting new provid

Re: [DISCUSS] DYNAMIC DAG RUNS WITH TASK PRIORITY

2024-03-10 Thread Jarek Potiuk
Also - not sure if you are subscribed to the devlist - so I will add your
direct address here so that you can for sure see the answer (and if you are
not subscribed, then by all means - do subscribe).

On Sun, Mar 10, 2024 at 12:01 PM Jarek Potiuk  wrote:

> I think before writing AIP in confluence, I would encourage you to try to
> describe your idea in a shared google docs document and explain it. But
> before you do that - I'd encourage you to take a close look and deep dive
> into implementation of priorities. It might be different than you think, it
> has priority weight algorithms that allow for inclusion of
> downstream/upstream task priorities, also since the way how airflow
> serializes the tasks, they are re-reread and refreshed by Airflow every 30
> seconds by default, so whatever priority_weights you set in DAGs will
> override the priorities that you **might** want to set via external API.
>
> Note that even today - tasks do not have "priorities" per se. They have
> "priority_weight" and "weight_rule"  - that is used to automatically
> determine what's the actual priority of the task based on those rules. So
> there is not a single "priority" you can override, there is a set of
> database queries to calculate those when tasks are eligible for execution,
> and you cannot simply "set" priority for the task this way.
>
> But there is more fundamental problem with the proposal - this proposal
> seems to validate a basic principle that we have in Airflow - that tasks
> and their behaviour is entirely defined by DAG authors who have access to
> the DAG folder and can change the task definition. See
> https://airflow.apache.org/docs/apache-airflow/stable/security/security_model.html
> - UI users (so also API users) - by definition cannot CHANGE DAG and task
> definitions. This is by design. They can run/rerun/clear tasks defined by
> DAG Authors - and it's the DAG authors that have ultimate influence on the
> definition of tasks. If you look very closely at the API, you will find
> that there is not a single API there that allows you to modify existing
> task definitions. Not a single one.
>
> The changes ALWAYS  come from the DAG folder. No exception
>
> So what you are proposing here is way more than just changing "a priority"
> of the task - you are proposing change in a fundamental assumption that
> Airflow takes - that authors of DAG are the only ones who can change it.
> Now - someone else will be able to change the task definition. Someone who
> is not a DAG author. And can change it independently from task definition.
>
> And it has far reaching consequences. For example we are just discussing a
> whole series of changes about dag Versioning:
>
> *
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-63%3A+DAG+Versioning
> *
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-64%3A+Keep+TaskInstance+try+history
> *
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-66%3A+Execution+of+specific+DAG+versions
>
> All those series of changes build on the assumption that task definition
> comes via changes in the DAG folder. They will not handle case where task
> definition can also be changed - independently - via other mechanisms, such
> an API call. Breaking the assumption will make the whole versioning way
> harder, or maybe even impossible.
>
> So I am not sure that the way you described your proposal is correct and
> implementation of it should not do what you propose. Maybe you should
> consider a different approach if you would like to change priorities of
> tasks (note that priorities of tasks are used when executor decides which
> of the tasks eligible for execution should be turned from queued , to
> running and should be given an execution slot).
>
> I think you have two ways, if you want to proceed with your idea:
>
> 1) Implement it by using the fact that Airflow DAGs are Python code. If
> you **really** want to permanently change priorities of tasks, you could
> simply write your DAGs in a specific way to use some variables (for example
> coming from local json file) as priorities and read it from there - and
> then, rather than making an API call to airflow webserver, you could change
> the priorities directly by changing priorities stored in those JSON files
> in the DAG folders. You could also directly modify priorities in the Python
> code as well - that's a bit more complex, but should also be possible. This
> is simple. Does not require to implement new features in Airflow, does not
> interfere with Airflow's security model and basic assumptions we have for
> DAG definition, does not have long-term effect on things like DAG
> versioning.
>
> 2) Maybe what you are a

Re: [DISCUSS] DYNAMIC DAG RUNS WITH TASK PRIORITY

2024-03-10 Thread Jarek Potiuk
I think before writing AIP in confluence, I would encourage you to try to
describe your idea in a shared google docs document and explain it. But
before you do that - I'd encourage you to take a close look and deep dive
into implementation of priorities. It might be different than you think, it
has priority weight algorithms that allow for inclusion of
downstream/upstream task priorities, also since the way how airflow
serializes the tasks, they are re-reread and refreshed by Airflow every 30
seconds by default, so whatever priority_weights you set in DAGs will
override the priorities that you **might** want to set via external API.

Note that even today - tasks do not have "priorities" per se. They have
"priority_weight" and "weight_rule"  - that is used to automatically
determine what's the actual priority of the task based on those rules. So
there is not a single "priority" you can override, there is a set of
database queries to calculate those when tasks are eligible for execution,
and you cannot simply "set" priority for the task this way.

But there is more fundamental problem with the proposal - this proposal
seems to validate a basic principle that we have in Airflow - that tasks
and their behaviour is entirely defined by DAG authors who have access to
the DAG folder and can change the task definition. See
https://airflow.apache.org/docs/apache-airflow/stable/security/security_model.html
- UI users (so also API users) - by definition cannot CHANGE DAG and task
definitions. This is by design. They can run/rerun/clear tasks defined by
DAG Authors - and it's the DAG authors that have ultimate influence on the
definition of tasks. If you look very closely at the API, you will find
that there is not a single API there that allows you to modify existing
task definitions. Not a single one.

The changes ALWAYS  come from the DAG folder. No exception

So what you are proposing here is way more than just changing "a priority"
of the task - you are proposing change in a fundamental assumption that
Airflow takes - that authors of DAG are the only ones who can change it.
Now - someone else will be able to change the task definition. Someone who
is not a DAG author. And can change it independently from task definition.

And it has far reaching consequences. For example we are just discussing a
whole series of changes about dag Versioning:

*
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-63%3A+DAG+Versioning
*
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-64%3A+Keep+TaskInstance+try+history
*
https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-66%3A+Execution+of+specific+DAG+versions

All those series of changes build on the assumption that task definition
comes via changes in the DAG folder. They will not handle case where task
definition can also be changed - independently - via other mechanisms, such
an API call. Breaking the assumption will make the whole versioning way
harder, or maybe even impossible.

So I am not sure that the way you described your proposal is correct and
implementation of it should not do what you propose. Maybe you should
consider a different approach if you would like to change priorities of
tasks (note that priorities of tasks are used when executor decides which
of the tasks eligible for execution should be turned from queued , to
running and should be given an execution slot).

I think you have two ways, if you want to proceed with your idea:

1) Implement it by using the fact that Airflow DAGs are Python code. If you
**really** want to permanently change priorities of tasks, you could simply
write your DAGs in a specific way to use some variables (for example coming
from local json file) as priorities and read it from there - and then,
rather than making an API call to airflow webserver, you could change the
priorities directly by changing priorities stored in those JSON files in
the DAG folders. You could also directly modify priorities in the Python
code as well - that's a bit more complex, but should also be possible. This
is simple. Does not require to implement new features in Airflow, does not
interfere with Airflow's security model and basic assumptions we have for
DAG definition, does not have long-term effect on things like DAG
versioning.

2) Maybe what you are after is to add a completely different mechanism to
decide on priorities - currently this mechanism uses priority_weights
stored by task and priority weight rules defined in DAG/task definition and
uses it to calculate the actual priority used when the executor decides
which tasks should be picked for execution.  This would be a completely new
feature that would have to be carefully designed and implemented - also
including the fact that we are just in the middle of implementing
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-61+Hybrid+Execution.
- hybrid executors and also we are discussing multi-tenancy proposal that
builds on the way how hybrid executors will be working

Re: Bad mixing of decorated and classic operators (users shooting themselves in their foot)

2024-03-09 Thread Jarek Potiuk
I see that we have already (thanks David!) a PR:
https://github.com/apache/airflow/pull/37937 to forbid this use (which is
cool and I am glad my discussion had some ripple effect :D ).

I am quite happy to get this one merged once it passes tests/reviews, but I
would still want to explore future departure / options we might have, maybe
there will be another - long term - ripple effect :). I thought a bit more
about  - possibly - different reasons why this pattern we observe is
emerging and I have a theory.

To Andrey's comments:

> I can't see which problem is solved by allowing running one operator
inside another.

For me, the main problem to solve is that using Hooks in the way I
described in
https://medium.com/apache-airflow/generic-airflow-transfers-made-easy-5fe8e5e7d2c2
in 2022 are almost non-discoverable by significant percentage of users.
Especially those kinds of users that mostly treat Airflow Operators as
black-box and **just** discovered task flow as a way that they can do
simple things in Python - but they are not into writing their own custom
operators, nor look at the operator's code. Generally they don't really see
DAG authoring as writing Python Code, it's mostly about using a little
weird DSL to build their DAGs. Mostly copy some constructs that
look like putting together existing building blocks and using patterns like
`>>` to add dependencies.

Yes I try to be empathetic and try to guess how such users think about DAG
authoring - I might be wrong, but this is what I see as a recurring pattern.

So in this context - @task is not Python code writing, it's yet another DSL
that people see as appealing. And the case (Here I just speculate - so I
might be entirely wrong) I **think** the original pattern I posted above
solve is that people think that they can slightly improve the flexibility
of the operators by adding a bit of simple code before when they need a bit
more flexibility and JINJA is not enough. Basically replacing this

operator = AnOperator(with_param='{{ here I want some dynamicness }}')

with:

@task
def my_task():
calculated_param = calculate_the_param()  # do something more complex
that is difficult to do with JINJA expression
operator = AnOperator(with_param=calculated_param)
operator.execute()

And I **think** the main issue to solve here is how to make it a bit more
flexible to get parameters of operators pre-calculated **just** before the
execute() method

This is speculation of course - and there might be different motivations -
but I think addressing this need better - might be actually solving the
problem (combined with David's PR). If we find a way to pass more complex
calculations to parameters of operators?

So MAYBE (just maybe) we could do something like that (conceptual - name
might be different)


operator=AnOperator(with_param=RunThisBeforeExecute(callable=calculate_the_param))

And let the user use a callable there:

def calculate_the_param(context: dict) -> Any

I **think** we could extend our "rendering JINJA template" to handle this
special case for templated parameters. Plus, it would nicely solve the
"render_as_native" problem - because that method could return the expected
object rather than string (and every parameter could have its own method

Maybe that would be a good solution ?

J.






On Sun, Mar 3, 2024 at 12:03 AM Daniel Standish
 wrote:

> One wrinkle to the have cake and eat it too approach is deferrable
> operators. It doesn't seem it would be very practical to resume back into
> the operator that is nested inside a taskflow function.  One solution would
> be to run the trigger in process like we currently do with `dag.test()`.
> That would make it non-deferrable in effect.  But at least it would run
> properly.  There may be other better solutions.
>


Re: [DISCUSS] Deprecation policy for apache-airflow-providers-google package

2024-03-09 Thread Jarek Potiuk
Just one comment. It might be nit-picking and a question of wording. I
don't think we can say that  `we guarantee` something. The whole ASF
licence is very explicit that we don't guarantee basically anything
and users are solely responsible for assessing their risks and we take
no responsibility whatsoever for anything. See section 7. of
https://www.apache.org/licenses/LICENSE-2.0

We can at most (as a PMC) express our intentions to do something.
There are legitimate cases where our intentions cannot be kept and we
will have to make tough decisions, bite the bullet and release
breaking changes. I don't think we can make `absolute` promises. But
we can do some safeguarding here, for example we could add a need to
cast a vote or run `lazy consensus`. That would be pretty acceptable
and the safeguard will be "serious" enough to make sure that yes,
indeed we are going to keep to our intentions.

But again - this is nit-picking for the wording possibly.

Just ane side note - in case the reason about breaking compatibility
would be about critical security issue causing it, the vote will have
to be kept on private@a.o with PMC's only, but this is also quite
acceptable and security is one of the very few exceptions were votes
(including voting on release) can be held in private@a.a.o and results
announced after the release.


J.

On Fri, Mar 8, 2024 at 11:58 PM Michał Modras
 wrote:
>
> To Jarek's phrasing - I think it is mostly right. The way I think about it:
> when deprecating we guarantee that for 6 months we will not do breaking
> changes, but after that period, on an occasion of releasing a major
> provider package version (which is probably breaking anyway) we can do
> cleanup and remove the old deprecated code.
>
> Elad - we are planning to do the cleanups, hence we wanted to put the
> policy in place for the Google provider, so we constraint ourselves and our
> users know what to expect, as opposed to doing things ad hoc and at will.
> In this sense the policy is redundant - we could not surface it, and
> internally follow these rules to the same effect, but we prefer to be
> transparent and inform the users of how the deprecation process will look.
> The goal of the policy is to inform as opposed to enforce anything stricter
> that we would like to do anyway.
>
> On Wed, Mar 6, 2024 at 5:14 AM Jarek Potiuk  wrote:
>
> > Hey Elad. I think (and let me repeat that) - nobody suggests that we
> > ALWAYS remove some deprecations after 6 months. The idea  - as I see
> > it - is completely reversed. As I understand current idea is that we
> > say "By default we do not introduce any breaking changes (i.e. we do
> > not remove deprecations) for 6 months from the moment we introduced
> > last breaking changes. This means that we will have have (again by
> > default) at most two breaking changes in each provider per year). Of
> > course this is just "default" and we can do deprecations more - or
> > less - often if we see we want to make exception, but we need to have
> > a very good reason for it. But I do not think anyone is putting an
> > expiry date and will "force" removal of deprecations at specific date.
> > Introducing breaking change in provider means that there are SOME
> > deprecations removed - maybe all, maybe just a few. I think the clue
> > of this change is to introduce more-or-less stable cadence "generally
> > speaking we make a breaking change in provider not more often than
> > every 6 months".
> >
> > It's very similar to our rule that we relase providers more-or-less
> > every 2 weeks. Sometimes we do it less often, sometimes we don't
> > release provider even if we could (for example we did that with FAB)
> > and sometimes we do ad-hoc release because we see a need for it.
> >
> > Is this correct rephrasing of the proposal - Eugen and Michał? Do I
> > understand it correctly ?
> >
> > J.
> >
> > On Tue, Mar 5, 2024 at 2:23 PM Elad Kalif  wrote:
> > >
> > > Eugen, about your two items:
> > > - What should be used instead
> > > - The date after which the method/parameter/operator will be removed
> > >
> > >
> > > The first is already included today in all deprecation warnings.
> > > The second is a big -1 from me.  We normally say that it will be removed
> > in
> > > the next major release. Sometimes we enforce it and sometimes we don't.
> > > That is our choice.
> > > I said it before and I will say it again. Time based deprecation is not
> > the
> > > way to do it. Sometimes the author of a PR and the reviewers are not
> > fully
> > > aware of the blast radius of a specific deprecation.
> &

Re: [VOTE] Release Airflow 2.8.3 from 2.8.3rc1

2024-03-07 Thread Jarek Potiuk
+1 (binding): checked reproducibility, sources, licences, checksums,
run a few dags, tested all my changes, all looks good.

One caveat. I think we still have the change that makes the default
image `airflow/2.8.3rc1` point to `airflow/2.8.3rc1-python3.11`
instead of `airfllow/2.8.3rc1-python3.8`. This is an aftermath of
moving the "default image" change to 2.9.* line after we found out the
bug in 2.8.0 that prevented the change from going live in 2.8 (but
then the fix crept-in the 2.8 branch)

Similarly as for 2.8.2 last week - I fixed it manually by `docker
login` followed by:

regctl image copy --force-recursive --digest-tags
apache/airflow:2.8.3rc1-python3.8 apache/airflow:2.8.3rc1
regctl image copy --force-recursive --digest-tags
apache/airflow:slim-2.8.3rc1-python3.8 apache/airflow:slim-2.8.3rc1

I think it's not worth cancelling the release and fixing it in the
code, we can "fix" the final images in the same way after they are
pushed. It takes literally a few seconds, we just have to remember to
do it before the announcement.

J.

On Thu, Mar 7, 2024 at 11:13 PM Ephraim Anierobi
 wrote:
>
> Hey fellow Airflowers,
>
> I have cut Airflow 2.8.3rc1. This email is calling a vote on the release,
> which will last at least 72 hours, from Thursday, March 7, 2024 at 10:10 pm
> UTC
> until Sunday, March 10, 2024, at 10:10 pm UTC
> ,
> and until 3 binding +1 votes have been received.
>
>
> The status of testing of the release is kept at
> https://github.com/apache/airflow/issues/37982
>
> Consider this my (binding) +1.
>
> Airflow 2.8.3rc1 is available at:
> https://dist.apache.org/repos/dist/dev/airflow/2.8.3rc1/
>
> *apache-airflow-2.8.3-source.tar.gz* is a source release that comes with
> INSTALL instructions.
> *apache-airflow-2.8.3.tar.gz* is the binary Python "sdist" release.
> *apache_airflow-2.8.3-py3-none-any.whl* is the binary Python wheel "binary"
> release.
>
> Public keys are available at:
> https://dist.apache.org/repos/dist/release/airflow/KEYS
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but all members of the community
> are encouraged to test the release and vote with "(non-binding)".
>
> The test procedure for PMC members is described in:
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md\#verify-the-release-candidate-by-pmc-members
>
> The test procedure for and Contributors who would like to test this RC is
> described in:
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md\#verify-the-release-candidate-by-contributors
>
>
> Please note that the version number excludes the `rcX` string, so it's now
> simply 2.8.3. This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release.
>
> Release Notes:
> https://github.com/apache/airflow/blob/2.8.3rc1/RELEASE_NOTES.rst
>
> For information on what goes into a release please see:
> https://github.com/apache/airflow/blob/main/dev/WHAT_GOES_INTO_THE_NEXT_RELEASE.md
>
> Changes since 2.8.2:
>
> *Significant Changes*
>
> The smtp provider is now pre-installed when you install Airflow. (#37713)
>
> *Bug Fixes*
> - Add "MENU" permission in auth manager (#37881)
> - Fix external_executor_id being overwritten (#37784)
> - Make more MappedOperator members modifiable (#37828)
> - Set parsing context dag_id in dag test command (#37606)
>
> *Miscellaneous*
> - Remove useless methods from security manager (#37889)
> - Improve code coverage for TriggerRuleDep (#37680)
> - The SMTP provider is now preinstalled when installing Airflow (#37713)
> - Bump min versions of openapi validators (#37691)
> - Properly include ``airflow_pre_installed_providers.txt`` artifact (#37679)
>
> *Doc Only Changes*
> - Clarify lack of sync between workers and scheduler (#37913)
> - Simplify some docs around airflow_local_settings (#37835)
> - Add section about local settings configuration (#37829)
> - Fix docs of ``BranchDayOfWeekOperator`` (#37813)
> - Write to secrets store is not supported by design (#37814)
> - ``ERD`` generating doc improvement (#37808)
> - Update incorrect config value (#37706)
> - Update security model to clarify Connection Editing user's capabilities
> (#37688)
> - Fix ImportError on examples dags (#37571)
>
> Cheers,
> Ephraim

-
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org



[DISCUSS] DRAFT AIP-67 Multi-tenant deployment of Airflow components

2024-03-06 Thread Jarek Potiuk
Sooo.. Seems that it's an AIP time :D I've just published a Draft of AIP-67:

Multi-tenant deployment of Airflow components

https://cwiki.apache.org/confluence/display/AIRFLOW/%5BDRAFT%5D+AIP-67+Multi-tenant+deployment+of+Airflow+components

This AIP  is a bit lighter in detail than the others you could see
from Jed , Nikolas and Maciej. This is really a DRAFT / High Level
idea of Multi-Tenancy that could be implemented as the follow-up after
previous steps of Multi-Tenancy implemented (or being implemented)
right now.

I decided to - rather than describe all the details now -  focus on
the concept of Multitenancy that I wanted to propose. Most of all
explaining the concept, comparing it to current ways of achieving some
forms of multi-tenancy and showing benefits and drawbacks of the
solution and connected costs (i.e. what complexity we need to add to
achieve it).

When thinking about Multi-tenancy, I realized few things:

* everyone might understand multi-tenancy differently
* some forms of multi-tenancy are achievable even today
* but - most of all - I started to question myself "Is this what we
can do, enough for some, sufficiently numerous groups of users to call
it a useful feature for them".

So before we get into more details - my aim is to make sure we are all
at the same page on what we CAN do as a multi-tenancy, and eventually
to decide whether we SHOULD do it.

Have fun. Bring in comments and feedback.

More about all the currently active AIPs at today's Town Hall

BTW. Do you think it's a surprise that 5 AIPS were announced just
before the Town Hall? I think not  :D

J.

-
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org



Re: [VOTE] Airflow Providers prepared on March 04, 2024

2024-03-05 Thread Jarek Potiuk
+1 (binding): tested /checked my changes. Checked reproducibility,
licences, signatures, checksums -  all looks good.

On Tue, Mar 5, 2024 at 5:51 PM Vincent Beck  wrote:
>
> +1 non binding. All AWS system tests are running successfully against 
> apache-airflow-providers-amazon==8.19.0rc1. You can see the results here: 
> https://aws-mwaa.github.io/#/open-source/system-tests/version/2852976ea6321b152ebc631d30d5526703bc6590_8.19.0rc1.html
>
> On 2024/03/04 21:34:04 Elad Kalif wrote:
> > Hey all,
> >
> > I have just cut the new wave Airflow Providers packages. This email is
> > calling a vote on the release,
> > which will last for 72 hours - which means that it will end on March 07,
> > 2024 21:30 PM UTC and until 3 binding +1 votes have been received.
> >
> > Consider this my (binding) +1.
> >
> > Airflow Providers are available at:
> > https://dist.apache.org/repos/dist/dev/airflow/providers/
> >
> > *apache-airflow-providers--*.tar.gz* are the binary
> >  Python "sdist" release - they are also official "sources" for the provider
> > packages.
> >
> > *apache_airflow_providers_-*.whl are the binary
> >  Python "wheel" release.
> >
> > The test procedure for PMC members is described in
> > https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-pmc-members
> >
> > The test procedure for and Contributors who would like to test this RC is
> > described in:
> > https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md#verify-the-release-candidate-by-contributors
> >
> >
> > Public keys are available at:
> > https://dist.apache.org/repos/dist/release/airflow/KEYS
> >
> > Please vote accordingly:
> >
> > [ ] +1 approve
> > [ ] +0 no opinion
> > [ ] -1 disapprove with the reason
> >
> > Only votes from PMC members are binding, but members of the community are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > Please note that the version number excludes the 'rcX' string.
> > This will allow us to rename the artifact without modifying
> > the artifact checksums when we actually release.
> >
> > The status of testing the providers by the community is kept here:
> > https://github.com/apache/airflow/issues/37890
> >
> > The issue is also the easiest way to see important PRs included in the RC
> > candidates.
> > Detailed changelog for the providers will be published in the documentation
> > after the
> > RC candidates are released.
> >
> > You can find the RC packages in PyPI following these links:
> >
> > https://pypi.org/project/apache-airflow-providers-amazon/8.19.0rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-beam/5.6.2rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-druid/3.9.0rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-hdfs/4.3.3rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-hive/7.0.1rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-livy/3.7.3rc1/
> > https://pypi.org/project/apache-airflow-providers-apache-pinot/4.3.1rc1/
> > https://pypi.org/project/apache-airflow-providers-celery/3.6.1rc1/
> > https://pypi.org/project/apache-airflow-providers-cncf-kubernetes/8.0.1rc1/
> > https://pypi.org/project/apache-airflow-providers-common-sql/1.11.1rc1/
> > https://pypi.org/project/apache-airflow-providers-dbt-cloud/3.7.0rc1/
> > https://pypi.org/project/apache-airflow-providers-docker/3.9.2rc1/
> > https://pypi.org/project/apache-airflow-providers-exasol/4.4.3rc1/
> > https://pypi.org/project/apache-airflow-providers-google/10.16.0rc1/
> > https://pypi.org/project/apache-airflow-providers-hashicorp/3.6.4rc1/
> > https://pypi.org/project/apache-airflow-providers-http/4.10.0rc1/
> > https://pypi.org/project/apache-airflow-providers-microsoft-azure/9.0.1rc1/
> > https://pypi.org/project/apache-airflow-providers-microsoft-psrp/2.6.0rc2/
> > https://pypi.org/project/apache-airflow-providers-mysql/5.5.4rc1/
> > https://pypi.org/project/apache-airflow-providers-openlineage/1.6.0rc1/
> > https://pypi.org/project/apache-airflow-providers-opensearch/1.1.2rc1/
> > https://pypi.org/project/apache-airflow-providers-postgres/5.10.2rc1/
> > https://pypi.org/project/apache-airflow-providers-presto/5.4.2rc1/
> > https://pypi.org/project/apache-airflow-providers-salesforce/5.6.3rc1/
> > https://pypi.org/project/apache-airflow-providers-smtp/1.6.1rc1/
> > https://pypi.org/project/apache-airflow-providers-telegram/4.4.0rc2/
> > https://pypi.org/project/apache-airflow-providers-trino/5.6.3rc1/
> > https://pypi.org/project/apache-airflow-providers-weaviate/1.3.3rc1/
> >
> > Cheers,
> > Elad Kalif
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For 

Re: [DISCUSS] Deprecation policy for apache-airflow-providers-google package

2024-03-05 Thread Jarek Potiuk
Hey Elad. I think (and let me repeat that) - nobody suggests that we
ALWAYS remove some deprecations after 6 months. The idea  - as I see
it - is completely reversed. As I understand current idea is that we
say "By default we do not introduce any breaking changes (i.e. we do
not remove deprecations) for 6 months from the moment we introduced
last breaking changes. This means that we will have have (again by
default) at most two breaking changes in each provider per year). Of
course this is just "default" and we can do deprecations more - or
less - often if we see we want to make exception, but we need to have
a very good reason for it. But I do not think anyone is putting an
expiry date and will "force" removal of deprecations at specific date.
Introducing breaking change in provider means that there are SOME
deprecations removed - maybe all, maybe just a few. I think the clue
of this change is to introduce more-or-less stable cadence "generally
speaking we make a breaking change in provider not more often than
every 6 months".

It's very similar to our rule that we relase providers more-or-less
every 2 weeks. Sometimes we do it less often, sometimes we don't
release provider even if we could (for example we did that with FAB)
and sometimes we do ad-hoc release because we see a need for it.

Is this correct rephrasing of the proposal - Eugen and Michał? Do I
understand it correctly ?

J.

On Tue, Mar 5, 2024 at 2:23 PM Elad Kalif  wrote:
>
> Eugen, about your two items:
> - What should be used instead
> - The date after which the method/parameter/operator will be removed
>
>
> The first is already included today in all deprecation warnings.
> The second is a big -1 from me.  We normally say that it will be removed in
> the next major release. Sometimes we enforce it and sometimes we don't.
> That is our choice.
> I said it before and I will say it again. Time based deprecation is not the
> way to do it. Sometimes the author of a PR and the reviewers are not fully
> aware of the blast radius of a specific deprecation.
> In Airflow any committer can merge any PR - I do not think it's right to
> put on the shoulders of the committer the decision of when something is to
> be removed before we even accept the deprecation.
>
> Michał, The policy we have now is a shared governance model which means
> that Google can decide and act on deprecating code and also on when to
> remove code.
> We as maintainers, and specifically I as release manager for provider will
> try to accomodate Google decision on this as Google knows best what is
> right for their customers.
> In the meantime, Google provider has many deprecation warnings some of them
> are very old - I didn't see Google start a cleanup deprecation process.
> If you think this is the right time then feel free to raise PRs for this.
> This is why I say that policy may be redundant - you can do it at will.
>
> On Tue, Mar 5, 2024 at 12:49 PM Eugen Kosteev  wrote:
>
> > Hi.
> >
> > Thanks for feedback and discussion.
> > Let me summarise what actually I want to propose (because there were a lot
> > in my initial email).
> >
> > My proposal is, in case of deprecation, to emit messages in the specific
> > format, example of the message:
> > “””
> > The “given” parameter is deprecated and will be removed after dd.mm..
> > Please use “this” parameter instead.
> > “””
> > The format of the warning message may vary, but it has to contain:
> > - What should be used instead
> > - The date after which the method/parameter/operator will be removed
> >
> > Everything else, such as decommission/removal only once bumping major
> > version, cadence of releases, etc. is not something that I propose to
> > change.
> >
> > The idea in general is to make Airflow users a convenient way to adjust
> > their DAGs to deprecations, thus:
> > - what to use instead
> > - when they need to update
> > should help them.
> >
> > If we have a date in the deprecation message, it means that the removal
> > (actual decommission) will happen in one of the major releases after this
> > date.
> >
> > What do you think about this?
> >
> >
> > On Mon, Mar 4, 2024 at 7:03 PM Michał Modras
> >  wrote:
> >
> > > First, I'd like to say that I support Eugen's proposal and I agree with
> > the
> > > enhancements suggested by Jarek.
> > >
> > > I'm a bit confused about Elad's point 3 - are you suggesting having a
> > > global policy for all providers, or that we should not codify our
> > approach
> > > at all since different providers have different requirements, instead of
>

Re: Bad mixing of decorated and classic operators (users shooting themselves in their foot)

2024-03-02 Thread Jarek Potiuk
rld",
> task_id="hello_operator",
> run_id="run_id",
> map_index=-1,
> select_columns=True,
> lock_for_update=False,
> session=session,
> ).thenReturn(None)
>
> operator = HelloWorldOperator(task_id="hello_operator", dag=dag)
>
> assert_that(operator.called).is_false()
>
> task_instance = TaskInstance(task=operator, run_id="run_id")
> task_instance.task_id = "hello_operator"
> task_instance.dag_id = "hello_world"
> task_instance.dag_run = DagRun(run_id="run_id", dag_id="hello_world", 
> execution_date=pendulum.now(), state=DagRunState.RUNNING)
> task_instance._run_raw_task(test_mode=True, session=session)
>
> assert_that(operator.called).is_true()
>
> Maybe start a pull request for this one?  What do you guys think?
>
> Kind regards,
> David
>
> -Original Message-
> From: Andrey Anshin 
> Sent: Tuesday, 27 February 2024 14:36
> To: dev@airflow.apache.org
> Subject: Re: Bad mixing of decorated and classic operators (users shooting 
> themselves in their foot)
>
> [You don't often get email from andrey.ans...@taragol.is. Learn why this is 
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze niet 
> vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, stuur 
> deze e-mail als bijlage naar ab...@infrabel.be<mailto:ab...@infrabel.be>.
>
> I can't see which problem is solved by allowing running one operator inside 
> another.
>
> *Propagate templated fields?*
> In most cases it could be changed for a particular task or entire operator by 
> monkey patching, which in this case more safe rather than run operator in 
> operator. Or even available out of the box
>
> *Do an expensive operation for preparing parameters to the classical
> operators?*
> Well, just calculate it into the separate task flow operator and propagate 
> output into the classical operator, most of them do not expect to have a huge 
> input and types are primitives.
>
> I'm not sure that is possible to solve all things in top of the 
> Task/Operators, e.g.:
> - Run it into the @task.docker, @task.kubernetes, @task.external, 
> @task.virtualenv
> - Run deferrable operators
> - Run rescheduling sensors
>
> Technically we could just run it into the separate executor inside of the 
> worker, I guess by the same way it worked in the past when for K8S and Celery 
> executors was an option Run Task.
>
> Another popular case it is run operator inside of callbacks, which are pretty 
> popular for the slack because there is some guidance from the Airflow 1.10.x 
> exists into the Internet even if we have Hooks and Notifiers for that, some 
> links from the first page of the Google on "airflow slack notifications" 
> request:
>
> *Bad Examples:*
> -
> https://towardsdatascience.com/automated-alerts-for-airflow-with-slack-5c6ec766a823
> -
> https://www.reply.com/data-reply/en/content/integrating-slack-alerts-in-airflow
> -
> https://awstip.com/integrating-slack-with-airflow-step-by-step-guide-to-setup-alerts-1dc71d5e65ef
> - https://naiveskill.com/airflow-slack-alert/
> - https://gist.github.com/kzzzr/a2a4152f6a7c03cd984e797c08ac702f
> -
> https://docs.astronomer.io/learn/error-notifications-in-airflow#legacy-slack-notifications-pre-26
>
> *Good Examples:*
> -
> https://www.restack.io/docs/airflow-knowledge-apache-providers-slack-webhook-http-pypi-operator
> -
> https://docs.astronomer.io/learn/error-notifications-in-airflow?tab=taskflow#example-pre-built-notifier-slack
>
> I do not know how to force users not to use this approach but I guess it is 
> better to avoid "If you can't win it, lead it".
>
>
> On Tue, 27 Feb 2024 at 13:48, Jarek Potiuk  wrote:
>
> > Yeah. I kinda like (and see it emerging from the discussion) that we
> > can (which I love) have cake and eat it too :). Means - I think we can
> > have both 1) and 2) ...
> >
> > 1) We should raise an error if someone uses AnOperator in task context
> > (the way TP described it would work nicely) - making calling the
> > `execute` pattern directly wrong
> > 2) MAYBE we can figure out a way to actually allow the users to use
> > the logic that Bolke described in a nice, and "supported" way. I would
> > actually love it, if we find an easy way to make the 3500+ operators
> > we have - immediately available to our taskflow users.
> >
> > I don't particularly like the idea of havin

Re: [VOTE] Release Apache Airflow Helm Chart 1.13.0 based on 1.13.0rc1

2024-03-02 Thread Jarek Potiuk
+1 (binding): verified reproducibility (yay - works nicely finally !),
signatures, checksums, licences. Installed Airflow 2.8.2 in kind
cluster using the "rc1" - published chart., played a bit with it - ran
a few DAGs using Breeze - using the same version as released in the
Helm Chart - all looks good.

On Sat, Mar 2, 2024 at 5:30 AM Jed Cunningham  wrote:
>
> Hello Apache Airflow Community,
>
> This is a call for the vote to release Helm Chart version 1.13.0.
>
> The release candidate is available at:
> https://dist.apache.org/repos/dist/dev/airflow/helm-chart/1.13.0rc1/
>
> airflow-chart-1.13.0-source.tar.gz - is the "main source release" that
> comes with INSTALL instructions.
> airflow-1.13.0.tgz - is the binary Helm Chart release.
>
> Public keys are available at: https://www.apache.org/dist/airflow/KEYS
>
> For convenience "index.yaml" has been uploaded (though excluded from
> voting), so you can also run the below commands.
>
> helm repo add apache-airflow-dev
> https://dist.apache.org/repos/dist/dev/airflow/helm-chart/1.13.0rc1/
> helm repo update
> helm install airflow apache-airflow-dev/airflow
>
> airflow-1.13.0.tgz.prov - is also uploaded for verifying Chart Integrity,
> though not strictly required for releasing the artifact based on ASF
> Guidelines.
>
> $ helm gpg verify airflow-1.13.0.tgz
> gpg: Signature made Fri Mar  1 21:16:51 2024 MST
> gpg:using RSA key E1A1E984F55B8F280BD9CBA20BB7163892A2E48E
> gpg:issuer "jedcunning...@apache.org"
> gpg: Good signature from "Jed Cunningham "
> [ultimate]
> plugin: Chart SHA verified.
> sha256:23155cf90b66c8ec6d49d2060686f90d23329eecf71c5368b1f0b06681b816cc
>
> The vote will be open for at least 72 hours (2024-03-05 04:35 UTC) or until
> the necessary number of votes is reached.
>
> https://www.timeanddate.com/countdown/to?iso=20240305T0435=136=cursive
>
> Please vote accordingly:
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> Only votes from PMC members are binding, but members of the community are
> encouraged to test the release and vote with "(non-binding)".
>
> Consider this my (binding) +1.
>
> For license checks, the .rat-excludes files is included, so you can run the
> following to verify licenses (just update your path to rat):
>
> tar -xvf airflow-chart-1.13.0-source.tar.gz
> cd airflow-chart-1.13.0
> java -jar apache-rat-0.13.jar chart -E .rat-excludes
>
> Please note that the version number excludes the `rcX` string, so it's now
> simply 1.13.0. This will allow us to rename the artifact without modifying
> the artifact checksums when we actually release it.
>
> The status of testing the Helm Chart by the community is kept here:
> https://github.com/apache/airflow/issues/37844
>
> Thanks,
> Jed

-
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org



Re: [DISCUSS] Considering trying out uv for our CI workflows

2024-02-29 Thread Jarek Potiuk
Follow up from today:

1) I optimised image building for CI a little bit stil (but that's
single-percent digit)
2) I also found a way to make uv works for PROD images. It was not working
so far because of our `--user` way of installing packages, but for a long
time I wanted to get rid of it as it caused many problems, but I believe I
finally found a good way to do it  - with 100% backwards compatibility with
some of the cases of PythonVirtualenv - previously I could not use
virtualenv to install Airflow in our PROD image but it seems that a small
trick with the right location of the venv in our image does the job with
100% compatibility

This one is a bit tricky - because we do not want (for a long time) to
switch `pip` to `uv` for our users, so while in CI most of the PROD images
(to save time) will be build with `--use-uv`, there is a separate build and
set of tests that will run for `--no-use-uv`. Regular users will have to
use `--build-arg AIRFLOW_USE_UV` to switch to using uv to build the image.
Bonus point: even in `pip` built images users will be able to use `uv` for
their installations (this is something our users are already asking for
https://github.com/apache/airflow/issues/37785 - seems like uv is -
similarly like ruff - spreading like fire).

The PR here: https://github.com/apache/airflow/pull/37796. Overall it's
~55% faster to build a PROD image from scratch with uv than with pip on my
machine (2m vs 4m45s)  - pretty consistent percentage gain as in the CI
image. End result are pretty much identical images (for size and looks like
content - and they pass our PROD image tests - and airflow works as usual
in them)

J.

On Tue, Feb 27, 2024 at 7:58 PM Jarek Potiuk  wrote:

> One more update - I am still looking at it and fine-tuning stuff and will
> have a few  more things coming
>
> I found out that we were still using `pip` for `pip constraints
> generation` (those are the constraints that our users use).
> I switched that one to `uv` and it's now 30 seconds instead of more than 5
> minutes - which is more than 10x improvement.
>
> Plus - we get all-canonical `pypi` names back, because I also switched to
> `uv pip freeze` one and uv nicely canonicalizes all the constraints
> generated. I am also switching now with
> https://github.com/apache/airflow/pull/37754 to a new 0.1.11 version that
> has some bug-fixes and new features, this PR also add upgrade-check that
> will tell us when the new version of `pip` and `uv` are available (by
> failing canary build job).
>
> J.
>
> On Tue, Feb 27, 2024 at 7:49 PM Oliveira, Niko 
> wrote:
>
>> Fantastic results!
>>
>> > It also means that if you've been using breeze and were sometimes
>> afraid to
>>
>> > hit "y" to rebuild the image, being afraid that it will take 20 minutes
>> or
>> > so - not any more. It should be WAY faster now.
>>
>> I'm very excited about this speed up as well as our CI :)
>>
>> 
>> From: Jarek Potiuk 
>> Sent: Tuesday, February 27, 2024 2:44:14 AM
>> To: dev@airflow.apache.org
>> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Considering trying
>> out uv for our CI workflows
>>
>> CAUTION: This email originated from outside of the organization. Do not
>> click links or open attachments unless you can confirm the sender and know
>> the content is safe.
>>
>>
>>
>> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
>> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
>> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
>> le contenu ne présente aucun risque.
>>
>>
>>
>> Summarising where we are:
>>
>> After ~24 hrs of operations, it looks really cool and fulfills (and
>> actually exceeds) all my expectations.
>>
>> * Multiple PRs succeeded, we got quite a few constraints updated
>> automatically after successful canary runs:
>> https://github.com/apache/airflow/commits/constraints-main/ (and they
>> look
>> perfectly fine - pretty much what I'd expect)
>> * I looked through a number of image builds in "canary" runs and the
>> regular 10-12 minutes build-image jobs are down to 3-4 minutes
>> * I just did an experiment and on my machine I run a complete from the
>> scratch CI image with new dependencies build for breeze (with `breeze ci
>> image build --python 3.9 --docker-cache disabled
>> --upgrade-to-newer-dependencies` ) and compared it with v2-8-test branch
>> where we do not have the change applied yet
>>
>> Results (on my desktop machine (16 cores, network 1Gb download and very
>> fast disk):
>>
>> * v2-8-test: 73

Re: [DISCUSS] Deprecation policy for apache-airflow-providers-google package

2024-02-29 Thread Jarek Potiuk
One more thing. This one is essentially impossible (Hyrum's law explains
that very well https://www.hyrumslaw.com/:

> The change is considered to be a breaking (in the context of google
providers package) if a DAG that was working before, stops to work after
the change.

I propose to change it to:

> The change is considered to be a breaking if a DAG using the providers
(in the
way they were intended to be used), that was working before, stops to work
after
the change.

It is absolutely not possible to make sure that every single DAG written by
someone will continue working after any change. Literally.

For example if someone extends BigQueryOperator and calls
(self._this_very_much_internal_method()` in his operator's new execute(),
and we rename the method, their DAG will stop working. Is it breaking ? No.
Is the user using it with the intention how it should be used? Absolutely
not.

SemVer - is not a promise things won't break, it's a promise that our
intention is that things won't break. But whether the way our users use it
will make it break or not, that's another story.

J






On Thu, Feb 29, 2024 at 12:38 PM Jarek Potiuk  wrote:

> BTW. Another example of an exception would be a security fix. We **migh**
> find a security issue that will require a breaking change. But then again -
> the breaking change might **only** cover that security fix - we do not have
> to remove all the deprecations in this case (but we could if we decide so -
> very much depending on nature of those deprecations)
>
> On Thu, Feb 29, 2024 at 12:36 PM Jarek Potiuk  wrote:
>
>> I think it's a good idea to codify the rules, indeed what Elad mentioned,
>> we **mostly** follow a very similar approach, but  it's not written
>> anywhere.
>>
>>
>> Re: points of Elad:
>>
>> 1. I am ok with 6 months, but I agree here we should not have "strict"
>> limits. It's fine to have some default (i.e. accumulate deprecating changes
>> for some time - max. 6 months) - but I think the policy should have an
>> option to apply exceptions - for example when we have a major change in
>> many libraries and we need to make some breaking changes (ads case) we
>> shall do it (but in this case we don't have to immediately remove all
>> deprecations - they can wait for the regular (say 6 months) breaking change
>> that will be coming. The users can - if needed - even in new versions of
>> Airflow - downgrade to previous provider versions in case they have problem
>> with it.
>>
>> I like the idea of having some time (6 months) where we can very safely
>> say "OK, enough is enough, lets remove deprecations"  - and this proposal
>> addresses it nicely. Currently it's purely subjective judgment. One small
>> thing here to clarify for "accumulating" deprecations - I think we should
>> make sure the deprecation is present in already released minor/patchlevel
>> version to be able to remove it (so even if it is deprecated 2 months ago
>> but we had release with the deprecation, it's ok to remove it in the next
>> breaking change.
>>
>> 2. Agree (and again here "strict" is not good I think). I'd treat that 6
>> months as a baseline. Before that - we should have very good reason to make
>> a breaking change (and we might decide to do it partially only) - after -
>> we are free to remove things that have been already announced as deprecated
>> and we should consider making breaking change whenever we release a new
>> version for whatever reason (but again - not mandatory).  I's just an
>> indication that tells us "Last breaking release was 6 months ago, it's now
>> OK to remove all the deprecations without having an extremely good reason".
>>
>> Very much agree with 3. -> this should be the same as for other
>> providers. We should adopt the same rule for all of them. And if we apply
>> the adjustments above - we should - I think - be ok to apply it to any
>> provider I think.
>>
>> J
>>
>>
>>
>> On Thu, Feb 29, 2024 at 12:20 PM Elad Kalif  wrote:
>>
>>> This is not too much different than what we already do.
>>> We deprecate then we remove. We almost never create breaking change
>>> without
>>> deprecating first.
>>>
>>> I'd like to raise a few notes:
>>>
>>> 1. Major releases can happen frequently. I see no problem with it. This
>>> is
>>> more a question of what are the changes. If google has 3 major releases
>>> one
>>> with breaking change to ads, another to leveldb and 3rd to cloud these
>>> are
>>> not related one to another and unlikely that there is 1 user that us

Re: [DISCUSS] Deprecation policy for apache-airflow-providers-google package

2024-02-29 Thread Jarek Potiuk
BTW. Another example of an exception would be a security fix. We **migh**
find a security issue that will require a breaking change. But then again -
the breaking change might **only** cover that security fix - we do not have
to remove all the deprecations in this case (but we could if we decide so -
very much depending on nature of those deprecations)

On Thu, Feb 29, 2024 at 12:36 PM Jarek Potiuk  wrote:

> I think it's a good idea to codify the rules, indeed what Elad mentioned,
> we **mostly** follow a very similar approach, but  it's not written
> anywhere.
>
>
> Re: points of Elad:
>
> 1. I am ok with 6 months, but I agree here we should not have "strict"
> limits. It's fine to have some default (i.e. accumulate deprecating changes
> for some time - max. 6 months) - but I think the policy should have an
> option to apply exceptions - for example when we have a major change in
> many libraries and we need to make some breaking changes (ads case) we
> shall do it (but in this case we don't have to immediately remove all
> deprecations - they can wait for the regular (say 6 months) breaking change
> that will be coming. The users can - if needed - even in new versions of
> Airflow - downgrade to previous provider versions in case they have problem
> with it.
>
> I like the idea of having some time (6 months) where we can very safely
> say "OK, enough is enough, lets remove deprecations"  - and this proposal
> addresses it nicely. Currently it's purely subjective judgment. One small
> thing here to clarify for "accumulating" deprecations - I think we should
> make sure the deprecation is present in already released minor/patchlevel
> version to be able to remove it (so even if it is deprecated 2 months ago
> but we had release with the deprecation, it's ok to remove it in the next
> breaking change.
>
> 2. Agree (and again here "strict" is not good I think). I'd treat that 6
> months as a baseline. Before that - we should have very good reason to make
> a breaking change (and we might decide to do it partially only) - after -
> we are free to remove things that have been already announced as deprecated
> and we should consider making breaking change whenever we release a new
> version for whatever reason (but again - not mandatory).  I's just an
> indication that tells us "Last breaking release was 6 months ago, it's now
> OK to remove all the deprecations without having an extremely good reason".
>
> Very much agree with 3. -> this should be the same as for other providers.
> We should adopt the same rule for all of them. And if we apply
> the adjustments above - we should - I think - be ok to apply it to any
> provider I think.
>
> J
>
>
>
> On Thu, Feb 29, 2024 at 12:20 PM Elad Kalif  wrote:
>
>> This is not too much different than what we already do.
>> We deprecate then we remove. We almost never create breaking change
>> without
>> deprecating first.
>>
>> I'd like to raise a few notes:
>>
>> 1. Major releases can happen frequently. I see no problem with it. This is
>> more a question of what are the changes. If google has 3 major releases
>> one
>> with breaking change to ads, another to leveldb and 3rd to cloud these are
>> not related one to another and unlikely that there is 1 user that uses all
>> of these together. We create major releases when needed and we keep a
>> close
>> eye and challenge why a change must be a breaking change. If possible we
>> will always prefer to be backward compatible.
>> 2. Time based deprecations are not a good approach. Some changes require
>> much more time to be acted upon and others may be trivial.
>> As a rule we almost never cut a breaking change release just to remove
>> deprecations.
>> 3. Setting a policy per provider is not the right way to go. We have a
>> shared governance model for provider handling and I as release manager
>> will
>> do what I can to accommodate requests from companies who participate in
>> managing the provider. Which means that if Google wants to remove
>> deprecations and ask for a breaking change release most likely we will
>> accommodate their request. My point of view is that the company knows what
>> is best for their own users. I don't think that what is right for Google
>> must bind AWS or Microsoft. 1 policy goes against the shared
>> governance idea.
>>
>>
>>
>> On Thu, Feb 29, 2024 at 11:04 AM Eugen Kosteev  wrote:
>>
>> > Hi.
>> >
>> > I would like to discuss/propose a deprecation policy for
>> > apache-airflow-providers-google package, mostly for deprecating Airflow
>> > 

  1   2   3   4   5   6   7   8   9   10   >