It’s not just release time, but any time we build a package which happens
on “every” CI run. The normal unit tests will use code from
airflow-common/src/airflow_common; the kube tests which build an image will
build the dists and vendor in the code from that commit.
There is only a single copy of the shared code committed to the repo, so
there is never anything to synchronise.
On 4 Jul 2025, at 15:53, Amogh Desai <amoghdesai....@gmail.com> wrote:
Thanks Ash.
This is really cool and helpful that you were able to test both scenarios
-- repo checkout
and also installing from the vendored package and the resolution worked
fine too.
I like this idea compared the to relative import one for few reasons:
- It feels like it will take some time to adjust to the new coding
standard
that we will lay
if we impose relative imports in the shared dist
- We can continue using repo wise absolute import standards, it is also
much easier for situations
when we do global search in IDE to find + replace, this could mean a
change
there
- The vendoring work is a proven and established paradigm across projects
and would
out of box give us the build tooling we need also
Nothing too against the relative import but with the evidence provided
above, vendored approach
seems to only do us good.
Regarding synchronizing it, release time should be fine as long as we
have
a good CI workflow to probably
catch such issues per PR if changes are made in shared dist? (precommit
would make it really slow i guess)
If we can run our tests with vendored code we should be mostly covered.
Good effort all!
Thanks & Regards,
Amogh Desai
On Fri, Jul 4, 2025 at 7:23 PM Ash Berlin-Taylor <a...@apache.org>
wrote:
Okay, I think I’ve got something that works and I’m happy with.
https://github.com/astronomer/airflow/tree/shared-vendored-lib-tasksdk-and-core
This produces the following from `uv build task-sdk`
-
https://github.com/user-attachments/files/21058976/apache_airflow_task_sdk-1.1.0.tar.gz
-
https://github.com/user-attachments/files/21058996/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip
(`.whl.zip` as GH won't allow .whl upload, but will .zip)
```
❯ unzip -l dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip |
grep
_vendor
50 02-02-2020 00:00 airflow/sdk/_vendor/.gitignore
2082 02-02-2020 00:00 airflow/sdk/_vendor/__init__.py
28 02-02-2020 00:00 airflow/sdk/_vendor/airflow_common.pyi
18 02-02-2020 00:00 airflow/sdk/_vendor/vendor.txt
785 02-02-2020 00:00
airflow/sdk/_vendor/airflow_common/__init__.py
10628 02-02-2020 00:00
airflow/sdk/_vendor/airflow_common/timezone.py
```
And similarly in the .tar.gz, so our “sdist” is complete too:
```
❯ tar -tzf dist/apache_airflow_task_sdk-1.1.0.tar.gz |grep _vendor
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/.gitignore
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/__init__.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common.pyi
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/vendor.txt
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/__init__.py
apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/timezone.py
```
The plugin works at build time by including/copying the libs specified
in
vendor.txt into place (and let `vendoring` take care of import
rewrites.)
For the imports to continue to work at “dev” time/from a repo checkout,
I
have added a import finder to `sys.meta_path`, and since it’s at the
end of
the list it will only be used if the normal import can’t find things.
https://github.com/astronomer/airflow/blob/996817782be6071b306a87af9f36fe1cf2d3aaa3/task-sdk/src/airflow/sdk/_vendor/__init__.py
This doesn’t quite give us the same runtime effect “import rewriting”
affect, as in this approach `airflow_common` is directly loaded (i.e.
airflow.sdk._vendor.airflow_common and airflow_common exist in
sys.modules), but it does work for everything that I was able to test..
I tested it with the diff at the end of this message. My test ipython
shell:
```
In [1]: from airflow.sdk._vendor.airflow_common.timezone import foo
In [2]: foo
Out[2]: 1
In [3]: import airflow.sdk._vendor.airflow_common
In [4]: import airflow.sdk._vendor.airflow_common.timezone
In [5]: airflow.sdk._vendor.airflow_common.__file__
Out[5]:
'/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/__init__.py'
In [6]: airflow.sdk._vendor.airflow_common.timezone.__file__
Out[6]:
'/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/timezone.py'
```
And in an standalone environment with the SDK dist I built (it needed
the
matching airflow-core right now, but that is nothing to do with this
discussion):
```
❯ _AIRFLOW__AS_LIBRARY=1 uvx --python 3.12 --with
dist/apache_airflow_core-3.1.0-py3-none-any.whl --with
dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl ipython
Python 3.12.7 (main, Oct 16 2024, 07:12:08) [Clang 18.1.8 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 9.4.0 -- An enhanced Interactive Python. Type '?' for help.
Tip: You can use `%hist` to view history, see the options with
`%history?`
In [1]: import airflow.sdk._vendor.airflow_common.timezone
In [2]: airflow.sdk._vendor.airflow_common.timezone.__file__
Out[2]:
'/Users/ash/.cache/uv/archive-v0/WWq6r65aPto2eJOyPObEH/lib/python3.12/site-packages/airflow/sdk/_vendor/airflow_common/timezone.py’
``
```diff
diff --git a/airflow-common/src/airflow_common/__init__.py
b/airflow-common/src/airflow_common/__init__.py
index 13a83393a9..927b7c6b61 100644
--- a/airflow-common/src/airflow_common/__init__.py
+++ b/airflow-common/src/airflow_common/__init__.py
@@ -14,3 +14,5 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
+
+foo = 1
diff --git a/airflow-common/src/airflow_common/timezone.py
b/airflow-common/src/airflow_common/timezone.py
index 340b924c66..58384ef20f 100644
--- a/airflow-common/src/airflow_common/timezone.py
+++ b/airflow-common/src/airflow_common/timezone.py
@@ -36,6 +36,9 @@ _PENDULUM3 =
version.parse(metadata.version("pendulum")).major == 3
# - FixedTimezone(0, "UTC") in pendulum 2
utc = pendulum.UTC
+
+from airflow_common import foo
+
TIMEZONE: Timezone
```
On 3 Jul 2025, at 12:43, Jarek Potiuk <ja...@potiuk.com> wrote:
I think both approaches are doable:
1) -> We can very easily prevent bad imports by pre-commit when
importing
from different distributions and make sure we are only doing relative
imports in the shared modules. We are doing plenty of this already. And
yes
it would require relative links we currently do not allow.
2) -> has one disadvantage that someone at some point in time will have
to
decide to synchronize this and if it happens just before release (I bet
this is going to happen) this will lead to solving problems that would
normally be solved during PR when you make a change (i.e. symbolic link
has
the advantage that whoever modifies shared code will be immediately
notified in their PR - that they broke something because either static
checks or mypy or tests fail.
Ash, do you have an idea of a process (who and when) does the
synchronisation in case of vendoring? Maybe we could solve it if it is
done
more frequently and with some regularity? We could potentially force
re-vendoring at PR time as well any time shared code changes (and
prevent
it by pre-commit. And I can't think of some place (other than releases)
in
our development workflow and that seems to be a bit too late as puts an
extra effort on fixing potential incompatibilities introduced on
release
manager and delays the release. WDYT?
Re: relative links. I think for a shared library we could potentially
relax
this and allow them (and actually disallow absolute links in the pieces
of
code that are shared - again, by pre-commit). As I recall, the only
reason
we forbade the relative links is because of how we are (or maybe were)
doing DAG parsing and failures resulting from it. So we decided to just
not
allow it to keep consistency. The way how Dag parsing works is that
when
you are using importlib to read the Dag from a file, the relative
imports
do not work as it does not know what they should be relative to. But if
relative import is done from an imported package, it should be no
problem,
I think - otherwise our Dags would not be able to import any library
that
uses relative imports.
Of course consistency might be the reason why we do not want to
introduce
relative imports. I don't see it as an issue if it is guarded by
pre-commit
though.
J.
J.
czw., 3 lip 2025, 12:11 użytkownik Ash Berlin-Taylor <a...@apache.org>
napisał:
Oh yes, symlinks will work, with one big caveat: It does mean you
can’t
use absolute imports in one common module to another.
For example
https://github.com/apache/airflow/blob/4c66ebd06/airflow-core/src/airflow/utils/serve_logs.py#L41
where we have
```
from airflow.utils.module_loading import import_string
```
if we want to move serve_logs into this common lib that is then
symlinked
then we wouldn’t be able to have `from airflow_common.module_loading
import
import_string`.
I can think of two possible solutions here.
1) is to allow/require relative imports in this shared lib, so `from
.module_loading import import_string`
2) is to use `vendoring`[1] (from the pip maintainers) which will
handle
import-rewriting for us.
I’d entirely forgot that symlinks in repos was a thing, so I prepared
a
minimal POC/demo of what vendoring approach could look like here
https://github.com/apache/airflow/commit/996817782be6071b306a87af9f36fe1cf2d3aaa3
Now personally I am more than happy with relative imports, but
generally
as a project we have avoided them, so I think that limits what we
could
do
with a symlink based approach.
-ash
[1] https://github.com/pradyunsg/vendoring
On 3 Jul 2025, at 10:30, Pavankumar Gopidesu <
gopidesupa...@gmail.com>
wrote:
Thanks Ash
Yes agree option 2 would be preferred for me. Making sure we have all
the
gaurdriles to protect any unwanted behaviour in code sharing and
executing
right of tests between the packages.
Agree with others, option 2 would be
On Thu, Jul 3, 2025 at 10:02 AM Amogh Desai <
amoghdesai....@gmail.com>
wrote:
Thanks for starting this discussion, Ash.
I would prefer option 2 here with proper tooling to handle the code
duplication at *release* time.
It is best to have a dist that has all it needs in itself.
Option 1 could very quickly get out of hand and if we decide to
separate
triggerer /
dag processor / config etc etc as separate packages, back compat is
going
to be a nightmare
and will bite us harder than we anticipate.
Thanks & Regards,
Amogh Desai
On Thu, Jul 3, 2025 at 1:12 AM Kaxil Naik <kaxiln...@gmail.com>
wrote:
I prefer Option 2 as well to avoid matrix of dependencies
On Thu, 3 Jul 2025 at 01:03, Jens Scheffler
<j_scheff...@gmx.de.invalid
wrote:
I'd also rather prefer option 2 - reason here is it is rather
pragmatic
and we no not need to cut another package and have less package
counts
and dependencies.
I remember some time ago I was checking (together with Jarek, I am
not
sure anymore...) if the usage of symlinks would be possible. To
keep
the
source in one package but "symlink" it into another. If then at
point
of
packaging/release the files are materialized we have 1 set of
code.
Otherwise if not possible still the redundancy could be solved by
a
pre-commit hook - and in Git the files are de-duplicated anyway
based
on
content hash, so this does not hurt.
On 02.07.25 18:49, Shahar Epstein wrote:
I support option 2 with proper automation & CI - the reasonings
you've
shown for that make sense to me.
Shahar
On Wed, Jul 2, 2025 at 3:36 PM Ash Berlin-Taylor <a...@apache.org
wrote:
Hello everyone,
As we work on finishing off the code-level separation of Task
SDK
and
Core
(scheduler etc) we have come across some situations where we
would
like
to
share code between these.
However it’s not as straight forward of “just put it in a common
dist
they
both depend upon” because one of the goals of the Task SDK
separation
was
to have 100% complete version independence between the two,
ideally
even if
they are built into the same image and venv. Most of the reason
why
this
isn’t straight forward comes down to backwards compatibility -
if
we
make
an change to the common/shared distribution
We’ve listed the options we have thought about in
https://github.com/apache/airflow/issues/51545 (but that covers
some
more
things that I don’t want to get in to in this discussion such as
possibly
separating operators and executors out of a single provider
dist.)
To give a concrete example of some code I would like to share
https://github.com/apache/airflow/blob/84897570bf7e438afb157ba4700768ea74824295/airflow-core/src/airflow/_logging/structlog.py
— logging config. Another thing we will want to share will be
the
AirflowConfigParser class from airflow.configuration (but
notably:
only
the
parser class, _not_ the default config values, again, lets not
dwell
on
the
specifics of that)
So to bring the options listed in the issue here for discussion,
broadly
speaking there are two high-level approaches:
1. A single shared distribution
2. No shared package and copy/duplicate code
The advantage of Approach 1 is that we only have the code in one
place.
However for me, at least in this specific case of Logging config
or
AirflowConfigParser class is that backwards compatibility is
much
much
harder.
The main advantage of Approach 2 is the the code is released
with/embedded
in the dist (i.e. apache-airflow-task-sdk would contain the
right
version
of the logging config and ConfigParser etc). The downside is
that
either
the code will need to be duplicated in the repo, or better yet
it
would
live in a single place in the repo, but some tooling (TBD) will
automatically handle the duplication, either at commit time, or
my
preference, at release time.
For this kind of shared “utility” code I am very strongly
leaning
towards
option 2 with automation, as otherwise I think the backwards
compatibility
requirements would make it unworkable (very quickly over time
the
combinations we would have to test would just be unreasonable)
and I
don’t
feel confident we can have things as stable as we need to really
deliver
the version separation/independency I want to delivery with
AIP-72.
So unless someone feels very strongly about this, I will come up
with
a
draft PR for further discussion that will implement code sharing
via
“vendoring” it at build time. I have an idea of how I can
achieve
this
so
we have a single version in the repo and it’ll work there, but
at
runtime
we vendor it in to the shipped dist so it lives at something
like
`airflow.sdk._vendor` etc.
In terms of repo layout, this likely means we would end up with:
airflow-core/pyproject.toml
airflow-core/src/
airflow-core/tests/
task-sdk/pyproject.toml
task-sdk/src/
task-sdk/tests/
airflow-common/src
airflow-common/tests/
# Possibly no airflow-common/pyproject.toml, as deps would be
included
in
the downstream projects. TBD.
Thoughts and feedback welcomed.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org