Hi Stefan and all, We've updated the ADRs [1], which now include: - All the points Jens raised (lifecycle, protocol, etc.) - All the points Stefan raised (forward-compatibility, backward-compatibility for the language runtime msgpack schema, etc.)
Additionally, we've settled on placing the Coordinator subclasses in `sdk.coordinators.java` instead of overloading the Provider (which should only expose features intended for Dag authors). See [2] for details. Thanks to Aritra, Ash, and Elad for raising these concerns and helping us settle on the right direction. [1] https://github.com/apache/airflow/pull/65956/changes#diff-0775f49c0e12a89d7dd25dd8908ea932bd063d6b29838786272cf8cdc7f8ed3b [2] https://github.com/apache/airflow/pull/65958#issuecomment-4386290759 Best, Jason On Wed, May 6, 2026 at 10:28 AM Tzu-ping Chung via dev < [email protected]> wrote: > Thanks Stefan, all great points. > > Since we are likely going to have more language SDKs proposed after the > coordinator layer is defined, I plan to have the comm protocol between the > task runner and the child process more formally defined. This will likely > be a separate AIP after the Java SDK and coordinator go in. > > Before that, let’s update the ADR to describe what the Java side expects > from the task runner (Python). > > TP > > > > On 5 May 2026, at 05:26, Stefan Wang <[email protected]> wrote: > > > > Thanks Jason, > > > > +1 from my side. > > > > I also spent some more time and checked the feature/java-sdk branch and > have a couple of suggestions to firm up the > > forward-compat story before more language SDKs land. > > > > > > What's already there is already good: TaskSdkFrames.kt configures the > Jackson mapper > > > with FAIL_ON_UNKNOWN_PROPERTIES=false, so additive Core changes > > > > (adding a field to StartupDetails, e.g.) are absorbed silently on the > > > > Java side. Python side via msgspec/Pydantic is forward-compat by > > > > default. The codec-level defense is in place. > > > > > > What I'd suggest documenting: > > > > > > 1. Promote that codec setting to an explicit contract in ADR 0003. > > > > Right now it's a one-line implementation detail. A future SDK > > author (Rust, additional Go work, future contributor swapping the > > > > mapper) could miss it. Worth stating the contract clearly: > > > > "the Coordinator IPC schema is forward-compatible. SDKs MUST > > > > configure their decoder to ignore unknown fields. Adding fields > > > > to existing messages is non-breaking; renames and type changes > > > > are breaking." > > > > I'm raising this because I've watched the analogous trap fire in > > > > a Python-server / generated-Java-client setup: the Java codegen > > tool there emits its own pre-Jackson whitelist check > > > > (validateJsonElement throwing IllegalArgumentException on unknown > > > > fields, regardless of how the downstream Jackson mapper is > > > > configured). One additive field change in the server response > > > > broke every consumer until clients re-generated. The "lenient > > > > codec" defense only works if every binding uses a lenient codec > > > > — and that's a property worth stating explicitly so future SDK > > > > authors know what they're committing to. > > > > > > 2. The harder cases are worth a sentence too. FAIL_ON_UNKNOWN=false > > > > covers additions, but in Comms.kt several StartupDetails fields > > are `lateinit var` — if Core removes one of those, deserialize > > > > succeeds silently and the first task-side access throws > > > > UninitializedPropertyAccessException at runtime. A sentence in > > > > the ADR distinguishing "additive (non-breaking)" from "rename / > > > > type-change / required-field removal (breaking, deprecation > > > > cycle required)" gives SDK maintainers a clear rule. > > > > > > 3. Worth considering a contract test that simulates Core-ahead-of- > > > > SDK on the IPC layer specifically. SerializationCompatibilityTest > > covers DAG JSON well; CommsTest exercises the current protocol. > > > > A small fixture-based test that feeds the Java decoder a frame > > > > with an unknown field / missing optional field / null in an > > > > optional position would catch codec-config regressions before > > > > they hit users. > > > > > > None of this blocks the AIP — codec config + tests can land in a > > > > follow-up. Mostly want to lock the contract in writing before more > > language SDKs make assumptions that diverge. > > > > > > > > Thank you for driving this! Appreciate the effort. > > > Stefan > > > > > >> On May 4, 2026, at 2:33 AM, Aritra Basu <[email protected]> > wrote: > >> > >> Hey Jason, > >> > >> I'd forgotten to comment on it, I'd taken a look at the proposal and it > >> looks good to me! Looking forward to see it go in! Was looking at the > pure > >> java dags code and it looked good. Will be taking a deeper look into it > >> soon! > >> > >> -- > >> Regards, > >> Aritra Basu > >> > >> On Mon, 4 May 2026, 1:30 pm Zhe-You(Jason) Liu, <[email protected]> > wrote: > >> > >>> Hi everyone, > >>> > >>> Hoping to get a few more eyes on the open "AIP-108: Java Task SDK and > the > >>> Language Coordinator Layer" and ADRs before we move forward. > >>> Please take a look and share any thoughts when you have a moment. > >>> > >>> - AIP: > >>> > >>> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-108+Java+Task+SDK+and+the+Language+Coordinator+Layer > >>> - ADRs: > >>> > >>> > https://github.com/apache/airflow/pull/65956/changes/876179ab55d3b31486ee52f4c27abd4e215b0fd0 > >>> - PR for adding Coordinator: > https://github.com/apache/airflow/pull/65958 > >>> - PR for adding Java-SDK: https://github.com/apache/airflow/pull/65956 > >>> > >>> Even a short comment or a +1 on AIP helps us understand where the > community > >>> stands. > >>> Thank you! > >>> > >>> Best, > >>> Jason > >>> > >>> On Tue, Apr 28, 2026 at 10:58 PM Jarek Potiuk <[email protected]> > wrote: > >>> > >>>> Oh super - nice. I am definitely going to take a look (at both) :D > >>>> > >>>> On Tue, Apr 28, 2026 at 9:55 AM Tzu-ping Chung via dev < > >>>> [email protected]> wrote: > >>>> > >>>>> Hi all, > >>>>> > >>>>> I’ve merged the ADRs and the previous LC message (in the other > thread) > >>>>> into a document formatted as an AIP: > >>>>> > >>>>> AIP-108 Java Task SDK and the Language Coordinator Layer - Airflow - > >>>>> Apache Software Foundation < > >>> https://cwiki.apache.org/confluence/x/pY4mGQ> > >>>>> cwiki.apache.org <https://cwiki.apache.org/confluence/x/pY4mGQ> > >>>>> [image: favicon.ico] <https://cwiki.apache.org/confluence/x/pY4mGQ> > >>>>> <https://cwiki.apache.org/confluence/x/pY4mGQ> > >>>>> > >>>>> > >>>>> Various sections are taken mostly directly from the other documents, > so > >>>>> you can probably glance over them if you’ve already read them. If you > >>>>> haven’t, the AIP document would be a good place to start since it > >>> removes > >>>>> some more detailed descriptions and focuses on the high level > >>> interfaces. > >>>>> More details are still available in the ADR documents included in the > >>> PRs > >>>>> Jason opened. > >>>>> > >>>>> TP > >>>>> > >>>>> > >>>>> On 28 Apr 2026, at 11:12, Zhe-You(Jason) Liu <[email protected]> > >>> wrote: > >>>>> > >>>>> Hi Jens, > >>>>> > >>>>> The ADRs are now available here [1] , hope that helps clarify some of > >>> your > >>>>> concerns. > >>>>> > >>>>> As I understood Java is static compiled JAR files, no on-the-fly > compile > >>>>> > >>>>> from Java source tree (correct?) > >>>>> > >>>>> That's correct -- the user will compile themselves and set the > `[java] > >>>>> bundles_folder` config to point to the directory of those JAR files. > >>>>> > >>>>> so actually the Dag parsing concept then is quite "static" and once > >>>>> > >>>>> generated actually no need to re-parse the Dag in Java mode? > >>>>> > >>>>> Still if static JAR deployed how does the deploy lifecacly look like? > >>>>> > >>>>> Would you need to restart with new deploy the Dag Parsr and > respective > >>>>> workers? > >>>>> > >>>>> The lifecycle for dag-parsing will be the same as how the current > >>>>> `DagFileProcessorProcess` acts. The coordinator comes into play > before > >>> we > >>>>> start the actual parse file entrypoint [2]. Regardless of what > language > >>>>> the > >>>>> parse file subprocess is implemented in, as long as it returns a > valid > >>>>> serialized Dag JSON in msgpack over IPC, the behavior will remain the > >>>>> same. > >>>>> So there is no need to restart the `airflow dag-processor`, and the > >>>>> `airflow worker` will not be involved in dag processing at all. > >>>>> > >>>>> Is it really realistic that a "LocalExecutor" needs to be supported > or > >>>>> > >>>>> can we limit it to e-g- only remote executors to reduce coupling and > >>>>> complexity of operating the core? > >>>>> > >>>>> The coordinator is the interface that decides how we want to launch > the > >>>>> subprocess for both dag-processing and workload-execution. This > means it > >>>>> will support **any** executor out of the box, as we integrate the > >>>>> coordinator at the TaskSDK level for workload-execution [3]. > >>>>> > >>>>> What overhead does the "Coordinator layer" generate compared to a > Java > >>>>> > >>>>> specific supervisor implementation? > >>>>> > >>>>> The "Coordinator layer" is the interface for Airflow-Core to interact > >>> with > >>>>> the target language subprocess. We still need a Java-specific > supervisor > >>>>> implementation, which is the first PR I mentioned in another thread > -- > >>>>> Java > >>>>> SDK [4]. > >>>>> > >>>>> Is it not only a new / additional process but also IPC involved then. > >>> And > >>>>> > >>>>> at least I saw also performance problems e.g. using very large XComs > >>> where > >>>>> even heartbeats are lost due to long running IPC > >>>>> > >>>>> I would consider this out of scope of the "Java SDK and the > Coordinator > >>>>> Layer" AIP, as the current Python-native TaskSDK supervisor will > >>> encounter > >>>>> the same issue. The multi-language support here follows the same > >>> protocol > >>>>> that the current TaskSDK uses. > >>>>> > >>>>> [1] > >>>>> > >>>>> > >>> > https://github.com/apache/airflow/pull/65956/changes/876179ab55d3b31486ee52f4c27abd4e215b0fd0 > >>>>> [2] > >>>>> > >>>>> > >>> > https://github.com/apache/airflow/pull/65958/changes#diff-564fd0a8fbe4cc47864a8043fcc1389b33120c88bb35852b26f45c36b902f70bR540-R573 > >>>>> [3] > >>>>> > >>>>> > >>> > https://github.com/apache/airflow/pull/65958/changes#diff-5bef10ab2956abf7360dbf9b509b6e1113407874d24abcc1b276475051f13abfR1991-R2001 > >>>>> [4] https://github.com/apache/airflow/pull/65956 > >>>>> > >>>>> Thanks. > >>>>> > >>>>> Best, > >>>>> Jason > >>>>> > >>>>> Best, > >>>>> Jason > >>>>> > >>>>> On Tue, Apr 28, 2026 at 3:03 AM Jens Scheffler <[email protected]> > >>>>> wrote: > >>>>> > >>>>> Thanks TP for raising this! > >>>>> > >>>>> I would need a sleep-over the block of information in the described > >>>>> details and might have some detail questions just to ensure I > understood > >>>>> right. So in a ADR or AIP document might be better to comment than > in an > >>>>> email thread. > >>>>> > >>>>> Things that jump into my head but there would be more coming thinking > >>>>> about it: > >>>>> > >>>>> * As I understood Java is static compiled JAR files, no on-the-fly > >>>>> compile from Java source tree (correct?) - in this case also the > >>>>> Dags are "static until re-deploy" - so actually the Dag parsing > >>>>> concept then is quite "static" and once generated actually no need > >>>>> to re-parse the Dag in Java mode? > >>>>> * Still if static JAR deployed how does the deploy lifecacly look > >>>>> like? Would you need to restart with new deploy the Dag Parsr and > >>>>> respective workers? > >>>>> * Is it really realistic that a "LocalExecutor" needs to be supported > >>>>> or can we limit it to e-g- only remote executors to reduce coupling > >>>>> and complexity of operating the core? > >>>>> * What overhead does the "Coordinator layer" generate compared to a > >>>>> Java specific supervisor implementation? Is it not only a new / > >>>>> additional process but also IPC involved then. And at least I saw > >>>>> also performance problems e.g. using very large XComs where even > >>>>> heartbeats are lost due to long running IPC > >>>>> (https://github.com/apache/airflow/issues/64628) > >>>>> * (There might be more coming :-) ) > >>>>> > >>>>> Jens > >>>>> > >>>>> P.S.: Questions here also do not mean rejection but like to > understand > >>>>> which complexity and overhead we have adding all this. > >>>>> > >>>>> On 27.04.26 15:21, Jarek Potiuk wrote: > >>>>> > >>>>> Also I would like to point out one thing. > >>>>> > >>>>> This should not be `LAZY CONSENSUS` just yet, this is quite a big > thing > >>>>> > >>>>> to > >>>>> > >>>>> discuss. I missed the subject already had it. > >>>>> > >>>>> At this stage this is really a discussion (I renamed the thread. > Because > >>>>> ... we have never discussed it before. > >>>>> > >>>>> LAZY CONSENSUS should be really called for after initial discussion > (on > >>>>> devlist) points to us actually reaching the consensus. > >>>>> > >>>>> While this one is unlikely to cause much controversy (I think), > >>>>> > >>>>> sufficient > >>>>> > >>>>> time for people to discuss and digest it before calling for lazy > >>>>> > >>>>> consensus > >>>>> > >>>>> is a necessary prerequisite. We simply need time to build consensus. > >>>>> > >>>>> We have a few processes where we build a "general" consensus first, > and > >>>>> then we apply it only to particular cases (such as new providers). > But > >>> in > >>>>> most cases when we have a "big" thing to discuss, we need to build > >>>>> consensus on devlist first. While in some cases people discussed > things > >>>>> off-list and came to some conclusions (whch is perfectly fine) - > >>> bringing > >>>>> it to the list as a consensus, where we do not know if we achieved it > >>>>> > >>>>> yet, > >>>>> > >>>>> is - I think - a bit premature. > >>>>> > >>>>> See the lazy consensus explanation [1] and "consenesus building [2] > >>>>> > >>>>> [1] Lazy consensus - > >>>>> > >>>>> > >>> > https://community.apache.org/committers/decisionMaking.html#lazy-consensus > >>>>> > >>>>> [2] Consensus building - > >>>>> > >>>>> > >>>>> > >>> > https://community.apache.org/committers/decisionMaking.html#consensus-building > >>>>> > >>>>> > >>>>> J. > >>>>> > >>>>> On Mon, Apr 27, 2026 at 1:27 PM Aritra Basu<[email protected] > > > >>>>> wrote: > >>>>> > >>>>> Hey TP, > >>>>> > >>>>> Overall +1, This is quite an interesting implementation. A couple > >>>>> questions, is provider the right place for the coordinator? Don't > have > >>>>> strong opinions or alternatives, but I am curious. > >>>>> > >>>>> Also for the parser wanted to understand a bit better how it works? I > >>>>> > >>>>> tried > >>>>> > >>>>> going through the SDK but wasn't able to fully understand it. Also > +1 to > >>>>> Jarek's recommendation for documentation. > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Regards, > >>>>> Aritra Basu > >>>>> > >>>>> On Mon, 27 Apr 2026, 11:39 am Tzu-ping Chung via dev, < > >>>>> [email protected]> wrote: > >>>>> > >>>>> Hi all, > >>>>> > >>>>> As mentioned in the latest dev call, we have been developing a Java > SDK > >>>>> with changes to Airflow in a separate fork[1]. We plan to start > merging > >>>>> > >>>>> the > >>>>> > >>>>> Java SDK work back into the OSS repository. > >>>>> > >>>>> We see this as a natural step following initial work in AIP-72[2], > >>>>> > >>>>> which > >>>>> > >>>>> created “a clean language agnostic interface for task execution, with > >>>>> support for multiple language bindings” (quoted from the proposal). > >>>>> > >>>>> The Java SDK also uses Ash’s addition of @task.stub[3] for the Go > SDK, > >>>>> > >>>>> to > >>>>> > >>>>> declare a task in a DAG to be “implemented elsewhere” (not in the > >>>>> > >>>>> annotated > >>>>> > >>>>> function). Similar to the Go SDK, we also created a Java library that > >>>>> > >>>>> users > >>>>> > >>>>> can use to write task implementations for Airflow to execute at > >>>>> > >>>>> runtime. > >>>>> > >>>>> > >>>>> [1]:https://github.com/astronomer/airflow/tree/feature/java-all > >>>>> [2]:https://cwiki.apache.org/confluence/x/xgmTEg > >>>>> [3]:https://github.com/apache/airflow/pull/56055 > >>>>> > >>>>> The user-facing syntax for a stub task would be the same as > implemented > >>>>> > >>>>> by > >>>>> > >>>>> the Go SDK: > >>>>> > >>>>> @task.stub(queue="java-tasks") > >>>>> def my_task(): ... > >>>>> > >>>>> With a new configuration option to map tasks in a pool to be executed > >>>>> > >>>>> by > >>>>> > >>>>> a > >>>>> > >>>>> specific SDK: > >>>>> > >>>>> [sdk] > >>>>> queue_to_sdk = {"java-tasks": "java"} > >>>>> > >>>>> The configuration is needed for some executors the Go SDK currently > >>>>> > >>>>> does > >>>>> > >>>>> not support. The Go SDK currently relies on each executor worker > >>>>> > >>>>> process > >>>>> > >>>>> to > >>>>> > >>>>> specify which queues they listen to, but this is not always viable, > >>>>> > >>>>> since > >>>>> > >>>>> some executors—LocalExecutor, for example—do not have the concept of > >>>>> > >>>>> worker > >>>>> > >>>>> processes. > >>>>> > >>>>> The Coordinator Layer > >>>>> ===================== > >>>>> > >>>>> When the Go SDK was implemented, it left out Runtime Airflow plugins > >>>>> > >>>>> as a > >>>>> > >>>>> future topic. This includes custom XCom backends, secrets backends > >>>>> > >>>>> lookup > >>>>> > >>>>> for connections and variables, etc. These components are implemented > in > >>>>> Python, and a Java task cannot easily use the feature unless we also > >>>>> implement the lookup logic in Java. We don’t want to do that since it > >>>>> introduces significant overhead to writing plugins, and the overhead > >>>>> multiplies with each new language SDK. > >>>>> > >>>>> Fortunately, the current execution-time task runner already uses a > >>>>> two-layer design. When an executor wants to run a task, it starts a > >>>>> (Python) task runner process that talks to Airflow Core through the > >>>>> Execution API, and *forks* another (Python) process, which talks to > the > >>>>> task runner through TCP, to run the actual task code. Airflow plugins > >>>>> simply go into the task runner process. > >>>>> > >>>>> This design works well for us since it keeps all the Airflow plugins > in > >>>>> Python. The only thing missing is an abstraction for the task runner > >>>>> process to run tasks in any language. We are calling this new layer > the > >>>>> **Coordinator**. > >>>>> > >>>>> When a DAG bundle is loaded, it not only tells Airflow how to find > the > >>>>> DAGs (and the tasks in them), but also how to *run* each task. > Current > >>>>> Python tasks use the Python Coordinator, running tasks by forking as > >>>>> previously described. A new JVM Coordinator will instruct the task > >>>>> > >>>>> runner > >>>>> > >>>>> how to run tasks packaged in JAR files. > >>>>> > >>>>> Each coordinator implements a base interface (BaseRuntimeCoordinator) > >>>>> > >>>>> that > >>>>> > >>>>> handles three concerns: > >>>>> > >>>>> - Discovery: determining whether a given file belongs to this > >>>>> > >>>>> coordinator > >>>>> > >>>>> (e.g. JAR files for Java). > >>>>> - DAG parsing: returning a runtime-specific subprocess command to > parse > >>>>> DAG files in the target language. > >>>>> - Task execution: returning a runtime-specific subprocess command to > >>>>> execute tasks in the target runtime. > >>>>> > >>>>> The base class owns the full bridge lifecycle—TCP servers, subprocess > >>>>> management, and cleanup—so language providers only need to implement > >>>>> > >>>>> these > >>>>> > >>>>> three methods. > >>>>> > >>>>> The coordinator translates a DagFileParseRequest (for DAG parsing) > and > >>>>> StartupDetails (for Task execution) data model (as declared in > Airflow) > >>>>> into the appropriate commands for the target runtime. For example, a > >>>>> > >>>>> “java > >>>>> > >>>>> -classpath ... /path/to/MainClass ...” subprocess command that points > >>>>> > >>>>> to > >>>>> > >>>>> the correct JAR file and main class in this case. > >>>>> > >>>>> Coordinators as Airflow Providers > >>>>> ================================= > >>>>> > >>>>> The base coordinator interface and the Python coordinator will live > in > >>>>> “airflow.sdk.execution_time”. Other coordinators (for foreign > >>>>> > >>>>> languages) > >>>>> > >>>>> are registered through the existing Airflow provider mechanism. Each > >>>>> > >>>>> SDK > >>>>> > >>>>> provider declares its coordinator in its provider.yaml under a > >>>>> “coordinators” extension point. Both ProvidersManager (airflow-core) > >>>>> > >>>>> and > >>>>> > >>>>> ProvidersManagerTaskRuntime (task-sdk) discover coordinators through > >>>>> > >>>>> this > >>>>> > >>>>> extension point. This means adding a new language runtime requires > >>>>> > >>>>> only a > >>>>> > >>>>> provider package. No changes to Airflow Core are needed. > >>>>> > >>>>> The new JVM-based coordinator will live in the namespace > >>>>> “airflow.providers.sdk.java”. This is not the most accurate name > >>>>> (technically it should be “jvm” instead), but in practice most users > >>>>> > >>>>> will > >>>>> > >>>>> recognize it, and (from my understanding) other JVM language users > >>>>> > >>>>> (e.g. > >>>>> > >>>>> Kotlin, Scala) are already well-versed enough dealing with Java > >>>>> interoperability to understand “java” means JVM in this context. > >>>>> > >>>>> Writing DAGs in Java > >>>>> ==================== > >>>>> > >>>>> This is not strictly connected to AIP-72, but considered by us as a > >>>>> natural next step since we can now implement tasks in a foreign > >>>>> > >>>>> language. > >>>>> > >>>>> Being able to define the DAG in the same language as the task > >>>>> implementation is useful since writing Python, even if only with > >>>>> > >>>>> minimal > >>>>> > >>>>> syntax, is still a hurdle for those not already familiar with, or > even > >>>>> allowed to run it. There are mainly three things we need on top of > the > >>>>> > >>>>> task > >>>>> > >>>>> implementation interface: > >>>>> > >>>>> - DAG flags (e.g. schedule, max_active_tasks) > >>>>> - Task flags (e.g. trigger_rule, weight_rule) > >>>>> - Task dependencies > >>>>> > >>>>> A proof-of-concept implementation is included with other changes > >>>>> > >>>>> proposed > >>>>> > >>>>> elsewhere in this document. > >>>>> > >>>>> Lazy Consensus Topics > >>>>> ===================== > >>>>> > >>>>> We’re calling for lazy consensus for the following topics > >>>>> > >>>>> - A new “queue_to_sdk” configuration option to route tasks to a > >>>>> > >>>>> specific > >>>>> > >>>>> language SDK > >>>>> - A new coordinator layer in the SDK to route implementations at > >>>>> > >>>>> execution > >>>>> > >>>>> time. > >>>>> - New providers under airflow.providers.sdk to provide additional > >>>>> > >>>>> language > >>>>> > >>>>> support. > >>>>> - Develop the Go SDK to support the proposed model and a provider > >>>>> > >>>>> package > >>>>> > >>>>> for the coordinator. (Existing features stay as-is; no breaking > >>>>> > >>>>> changes.) > >>>>> > >>>>> - Add the new Java SDK and the corresponding provider package. > >>>>> > >>>>> TP > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail:[email protected] > >>>>> For additional commands, e-mail:[email protected] > >>>>> > >>>>> > >>>>> > >>>>> > >>> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
