Thanks Ash. This is really cool and helpful that you were able to test both scenarios -- repo checkout and also installing from the vendored package and the resolution worked fine too.
I like this idea compared the to relative import one for few reasons: - It feels like it will take some time to adjust to the new coding standard that we will lay if we impose relative imports in the shared dist - We can continue using repo wise absolute import standards, it is also much easier for situations when we do global search in IDE to find + replace, this could mean a change there - The vendoring work is a proven and established paradigm across projects and would out of box give us the build tooling we need also Nothing too against the relative import but with the evidence provided above, vendored approach seems to only do us good. Regarding synchronizing it, release time should be fine as long as we have a good CI workflow to probably catch such issues per PR if changes are made in shared dist? (precommit would make it really slow i guess) If we can run our tests with vendored code we should be mostly covered. Good effort all! Thanks & Regards, Amogh Desai On Fri, Jul 4, 2025 at 7:23 PM Ash Berlin-Taylor <a...@apache.org> wrote: > Okay, I think I’ve got something that works and I’m happy with. > > > https://github.com/astronomer/airflow/tree/shared-vendored-lib-tasksdk-and-core > > This produces the following from `uv build task-sdk` > - > https://github.com/user-attachments/files/21058976/apache_airflow_task_sdk-1.1.0.tar.gz > - > https://github.com/user-attachments/files/21058996/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip > > (`.whl.zip` as GH won't allow .whl upload, but will .zip) > > ``` > ❯ unzip -l dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl.zip | grep > _vendor > 50 02-02-2020 00:00 airflow/sdk/_vendor/.gitignore > 2082 02-02-2020 00:00 airflow/sdk/_vendor/__init__.py > 28 02-02-2020 00:00 airflow/sdk/_vendor/airflow_common.pyi > 18 02-02-2020 00:00 airflow/sdk/_vendor/vendor.txt > 785 02-02-2020 00:00 > airflow/sdk/_vendor/airflow_common/__init__.py > 10628 02-02-2020 00:00 > airflow/sdk/_vendor/airflow_common/timezone.py > ``` > > And similarly in the .tar.gz, so our “sdist” is complete too: > ``` > ❯ tar -tzf dist/apache_airflow_task_sdk-1.1.0.tar.gz |grep _vendor > apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/.gitignore > apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/__init__.py > apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common.pyi > apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/vendor.txt > > apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/__init__.py > > apache_airflow_task_sdk-1.1.0/src/airflow/sdk/_vendor/airflow_common/timezone.py > ``` > > The plugin works at build time by including/copying the libs specified in > vendor.txt into place (and let `vendoring` take care of import rewrites.) > > For the imports to continue to work at “dev” time/from a repo checkout, I > have added a import finder to `sys.meta_path`, and since it’s at the end of > the list it will only be used if the normal import can’t find things. > > > https://github.com/astronomer/airflow/blob/996817782be6071b306a87af9f36fe1cf2d3aaa3/task-sdk/src/airflow/sdk/_vendor/__init__.py > > This doesn’t quite give us the same runtime effect “import rewriting” > affect, as in this approach `airflow_common` is directly loaded (i.e. > airflow.sdk._vendor.airflow_common and airflow_common exist in > sys.modules), but it does work for everything that I was able to test.. > > I tested it with the diff at the end of this message. My test ipython > shell: > > ``` > In [1]: from airflow.sdk._vendor.airflow_common.timezone import foo > > In [2]: foo > Out[2]: 1 > > In [3]: import airflow.sdk._vendor.airflow_common > > In [4]: import airflow.sdk._vendor.airflow_common.timezone > > In [5]: airflow.sdk._vendor.airflow_common.__file__ > Out[5]: > '/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/__init__.py' > > In [6]: airflow.sdk._vendor.airflow_common.timezone.__file__ > Out[6]: > '/Users/ash/code/airflow/airflow/airflow-common/src/airflow_common/timezone.py' > > ``` > > > And in an standalone environment with the SDK dist I built (it needed the > matching airflow-core right now, but that is nothing to do with this > discussion): > > ``` > ❯ _AIRFLOW__AS_LIBRARY=1 uvx --python 3.12 --with > dist/apache_airflow_core-3.1.0-py3-none-any.whl --with > dist/apache_airflow_task_sdk-1.1.0-py3-none-any.whl ipython > Python 3.12.7 (main, Oct 16 2024, 07:12:08) [Clang 18.1.8 ] > Type 'copyright', 'credits' or 'license' for more information > IPython 9.4.0 -- An enhanced Interactive Python. Type '?' for help. > Tip: You can use `%hist` to view history, see the options with `%history?` > > In [1]: import airflow.sdk._vendor.airflow_common.timezone > > In [2]: airflow.sdk._vendor.airflow_common.timezone.__file__ > Out[2]: > '/Users/ash/.cache/uv/archive-v0/WWq6r65aPto2eJOyPObEH/lib/python3.12/site-packages/airflow/sdk/_vendor/airflow_common/timezone.py’ > `` > > > > ```diff > diff --git a/airflow-common/src/airflow_common/__init__.py > b/airflow-common/src/airflow_common/__init__.py > index 13a83393a9..927b7c6b61 100644 > --- a/airflow-common/src/airflow_common/__init__.py > +++ b/airflow-common/src/airflow_common/__init__.py > @@ -14,3 +14,5 @@ > # KIND, either express or implied. See the License for the > # specific language governing permissions and limitations > # under the License. > + > +foo = 1 > diff --git a/airflow-common/src/airflow_common/timezone.py > b/airflow-common/src/airflow_common/timezone.py > index 340b924c66..58384ef20f 100644 > --- a/airflow-common/src/airflow_common/timezone.py > +++ b/airflow-common/src/airflow_common/timezone.py > @@ -36,6 +36,9 @@ _PENDULUM3 = > version.parse(metadata.version("pendulum")).major == 3 > # - FixedTimezone(0, "UTC") in pendulum 2 > utc = pendulum.UTC > > + > +from airflow_common import foo > + > TIMEZONE: Timezone > > > ``` > > > On 3 Jul 2025, at 12:43, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > I think both approaches are doable: > > > > 1) -> We can very easily prevent bad imports by pre-commit when importing > > from different distributions and make sure we are only doing relative > > imports in the shared modules. We are doing plenty of this already. And > yes > > it would require relative links we currently do not allow. > > > > 2) -> has one disadvantage that someone at some point in time will have > to > > decide to synchronize this and if it happens just before release (I bet > > this is going to happen) this will lead to solving problems that would > > normally be solved during PR when you make a change (i.e. symbolic link > has > > the advantage that whoever modifies shared code will be immediately > > notified in their PR - that they broke something because either static > > checks or mypy or tests fail. > > > > Ash, do you have an idea of a process (who and when) does the > > synchronisation in case of vendoring? Maybe we could solve it if it is > done > > more frequently and with some regularity? We could potentially force > > re-vendoring at PR time as well any time shared code changes (and prevent > > it by pre-commit. And I can't think of some place (other than releases) > in > > our development workflow and that seems to be a bit too late as puts an > > extra effort on fixing potential incompatibilities introduced on release > > manager and delays the release. WDYT? > > > > Re: relative links. I think for a shared library we could potentially > relax > > this and allow them (and actually disallow absolute links in the pieces > of > > code that are shared - again, by pre-commit). As I recall, the only > reason > > we forbade the relative links is because of how we are (or maybe were) > > doing DAG parsing and failures resulting from it. So we decided to just > not > > allow it to keep consistency. The way how Dag parsing works is that when > > you are using importlib to read the Dag from a file, the relative imports > > do not work as it does not know what they should be relative to. But if > > relative import is done from an imported package, it should be no > problem, > > I think - otherwise our Dags would not be able to import any library that > > uses relative imports. > > > > Of course consistency might be the reason why we do not want to introduce > > relative imports. I don't see it as an issue if it is guarded by > pre-commit > > though. > > > > J. > > > > > > J. > > > > > > czw., 3 lip 2025, 12:11 użytkownik Ash Berlin-Taylor <a...@apache.org> > > napisał: > > > >> Oh yes, symlinks will work, with one big caveat: It does mean you can’t > >> use absolute imports in one common module to another. > >> > >> For example > >> > https://github.com/apache/airflow/blob/4c66ebd06/airflow-core/src/airflow/utils/serve_logs.py#L41 > >> where we have > >> > >> ``` > >> from airflow.utils.module_loading import import_string > >> ``` > >> > >> if we want to move serve_logs into this common lib that is then > symlinked > >> then we wouldn’t be able to have `from airflow_common.module_loading > import > >> import_string`. > >> > >> I can think of two possible solutions here. > >> > >> 1) is to allow/require relative imports in this shared lib, so `from > >> .module_loading import import_string` > >> 2) is to use `vendoring`[1] (from the pip maintainers) which will handle > >> import-rewriting for us. > >> > >> I’d entirely forgot that symlinks in repos was a thing, so I prepared a > >> minimal POC/demo of what vendoring approach could look like here > >> > https://github.com/apache/airflow/commit/996817782be6071b306a87af9f36fe1cf2d3aaa3 > >> > >> Now personally I am more than happy with relative imports, but generally > >> as a project we have avoided them, so I think that limits what we could > do > >> with a symlink based approach. > >> > >> -ash > >> > >> [1] https://github.com/pradyunsg/vendoring > >> > >>> On 3 Jul 2025, at 10:30, Pavankumar Gopidesu <gopidesupa...@gmail.com> > >> wrote: > >>> > >>> Thanks Ash > >>> > >>> Yes agree option 2 would be preferred for me. Making sure we have all > the > >>> gaurdriles to protect any unwanted behaviour in code sharing and > >> executing > >>> right of tests between the packages. > >>> > >>> Agree with others, option 2 would be > >>> > >>> On Thu, Jul 3, 2025 at 10:02 AM Amogh Desai <amoghdesai....@gmail.com> > >>> wrote: > >>> > >>>> Thanks for starting this discussion, Ash. > >>>> > >>>> I would prefer option 2 here with proper tooling to handle the code > >>>> duplication at *release* time. > >>>> It is best to have a dist that has all it needs in itself. > >>>> > >>>> Option 1 could very quickly get out of hand and if we decide to > separate > >>>> triggerer / > >>>> dag processor / config etc etc as separate packages, back compat is > >> going > >>>> to be a nightmare > >>>> and will bite us harder than we anticipate. > >>>> > >>>> Thanks & Regards, > >>>> Amogh Desai > >>>> > >>>> > >>>> On Thu, Jul 3, 2025 at 1:12 AM Kaxil Naik <kaxiln...@gmail.com> > wrote: > >>>> > >>>>> I prefer Option 2 as well to avoid matrix of dependencies > >>>>> > >>>>> On Thu, 3 Jul 2025 at 01:03, Jens Scheffler > <j_scheff...@gmx.de.invalid > >>> > >>>>> wrote: > >>>>> > >>>>>> I'd also rather prefer option 2 - reason here is it is rather > >> pragmatic > >>>>>> and we no not need to cut another package and have less package > counts > >>>>>> and dependencies. > >>>>>> > >>>>>> I remember some time ago I was checking (together with Jarek, I am > not > >>>>>> sure anymore...) if the usage of symlinks would be possible. To keep > >>>> the > >>>>>> source in one package but "symlink" it into another. If then at > point > >>>> of > >>>>>> packaging/release the files are materialized we have 1 set of code. > >>>>>> > >>>>>> Otherwise if not possible still the redundancy could be solved by a > >>>>>> pre-commit hook - and in Git the files are de-duplicated anyway > based > >>>> on > >>>>>> content hash, so this does not hurt. > >>>>>> > >>>>>> On 02.07.25 18:49, Shahar Epstein wrote: > >>>>>>> I support option 2 with proper automation & CI - the reasonings > >>>> you've > >>>>>>> shown for that make sense to me. > >>>>>>> > >>>>>>> > >>>>>>> Shahar > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Jul 2, 2025 at 3:36 PM Ash Berlin-Taylor <a...@apache.org> > >>>>> wrote: > >>>>>>> > >>>>>>>> Hello everyone, > >>>>>>>> > >>>>>>>> As we work on finishing off the code-level separation of Task SDK > >>>> and > >>>>>> Core > >>>>>>>> (scheduler etc) we have come across some situations where we would > >>>>> like > >>>>>> to > >>>>>>>> share code between these. > >>>>>>>> > >>>>>>>> However it’s not as straight forward of “just put it in a common > >>>> dist > >>>>>> they > >>>>>>>> both depend upon” because one of the goals of the Task SDK > >>>> separation > >>>>>> was > >>>>>>>> to have 100% complete version independence between the two, > ideally > >>>>>> even if > >>>>>>>> they are built into the same image and venv. Most of the reason > why > >>>>> this > >>>>>>>> isn’t straight forward comes down to backwards compatibility - if > we > >>>>>> make > >>>>>>>> an change to the common/shared distribution > >>>>>>>> > >>>>>>>> > >>>>>>>> We’ve listed the options we have thought about in > >>>>>>>> https://github.com/apache/airflow/issues/51545 (but that covers > >>>> some > >>>>>> more > >>>>>>>> things that I don’t want to get in to in this discussion such as > >>>>>> possibly > >>>>>>>> separating operators and executors out of a single provider dist.) > >>>>>>>> > >>>>>>>> To give a concrete example of some code I would like to share > >>>>>>>> > >>>>>> > >>>>> > >>>> > >> > https://github.com/apache/airflow/blob/84897570bf7e438afb157ba4700768ea74824295/airflow-core/src/airflow/_logging/structlog.py > >>>>>>>> — logging config. Another thing we will want to share will be the > >>>>>>>> AirflowConfigParser class from airflow.configuration (but notably: > >>>>> only > >>>>>> the > >>>>>>>> parser class, _not_ the default config values, again, lets not > dwell > >>>>> on > >>>>>> the > >>>>>>>> specifics of that) > >>>>>>>> > >>>>>>>> So to bring the options listed in the issue here for discussion, > >>>>> broadly > >>>>>>>> speaking there are two high-level approaches: > >>>>>>>> > >>>>>>>> 1. A single shared distribution > >>>>>>>> 2. No shared package and copy/duplicate code > >>>>>>>> > >>>>>>>> The advantage of Approach 1 is that we only have the code in one > >>>>> place. > >>>>>>>> However for me, at least in this specific case of Logging config > or > >>>>>>>> AirflowConfigParser class is that backwards compatibility is much > >>>> much > >>>>>>>> harder. > >>>>>>>> > >>>>>>>> The main advantage of Approach 2 is the the code is released > >>>>>> with/embedded > >>>>>>>> in the dist (i.e. apache-airflow-task-sdk would contain the right > >>>>>> version > >>>>>>>> of the logging config and ConfigParser etc). The downside is that > >>>>> either > >>>>>>>> the code will need to be duplicated in the repo, or better yet it > >>>>> would > >>>>>>>> live in a single place in the repo, but some tooling (TBD) will > >>>>>>>> automatically handle the duplication, either at commit time, or my > >>>>>>>> preference, at release time. > >>>>>>>> > >>>>>>>> For this kind of shared “utility” code I am very strongly leaning > >>>>>> towards > >>>>>>>> option 2 with automation, as otherwise I think the backwards > >>>>>> compatibility > >>>>>>>> requirements would make it unworkable (very quickly over time the > >>>>>>>> combinations we would have to test would just be unreasonable) > and I > >>>>>> don’t > >>>>>>>> feel confident we can have things as stable as we need to really > >>>>> deliver > >>>>>>>> the version separation/independency I want to delivery with > AIP-72. > >>>>>>>> > >>>>>>>> So unless someone feels very strongly about this, I will come up > >>>> with > >>>>> a > >>>>>>>> draft PR for further discussion that will implement code sharing > via > >>>>>>>> “vendoring” it at build time. I have an idea of how I can achieve > >>>> this > >>>>>> so > >>>>>>>> we have a single version in the repo and it’ll work there, but at > >>>>>> runtime > >>>>>>>> we vendor it in to the shipped dist so it lives at something like > >>>>>>>> `airflow.sdk._vendor` etc. > >>>>>>>> > >>>>>>>> In terms of repo layout, this likely means we would end up with: > >>>>>>>> > >>>>>>>> airflow-core/pyproject.toml > >>>>>>>> airflow-core/src/ > >>>>>>>> airflow-core/tests/ > >>>>>>>> task-sdk/pyproject.toml > >>>>>>>> task-sdk/src/ > >>>>>>>> task-sdk/tests/ > >>>>>>>> airflow-common/src > >>>>>>>> airflow-common/tests/ > >>>>>>>> # Possibly no airflow-common/pyproject.toml, as deps would be > >>>> included > >>>>>> in > >>>>>>>> the downstream projects. TBD. > >>>>>>>> > >>>>>>>> Thoughts and feedback welcomed. > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > >>>>>> For additional commands, e-mail: dev-h...@airflow.apache.org > >>>>>> > >>>>>> > >>>>> > >>>> > >> > >> > >