Thanks Stefan, all great points. Since we are likely going to have more language SDKs proposed after the coordinator layer is defined, I plan to have the comm protocol between the task runner and the child process more formally defined. This will likely be a separate AIP after the Java SDK and coordinator go in.
Before that, let’s update the ADR to describe what the Java side expects from the task runner (Python). TP > On 5 May 2026, at 05:26, Stefan Wang <[email protected]> wrote: > > Thanks Jason, > > +1 from my side. > > I also spent some more time and checked the feature/java-sdk branch and have > a couple of suggestions to firm up the > forward-compat story before more language SDKs land. > > > > What's already there is already good: TaskSdkFrames.kt configures the > Jackson mapper > > with FAIL_ON_UNKNOWN_PROPERTIES=false, so additive Core changes > > > (adding a field to StartupDetails, e.g.) are absorbed silently on the > > > Java side. Python side via msgspec/Pydantic is forward-compat by > > > default. The codec-level defense is in place. > > > > What I'd suggest documenting: > > > > 1. Promote that codec setting to an explicit contract in ADR 0003. > > > Right now it's a one-line implementation detail. A future SDK > author (Rust, additional Go work, future contributor swapping the > > > mapper) could miss it. Worth stating the contract clearly: > > > "the Coordinator IPC schema is forward-compatible. SDKs MUST > > > configure their decoder to ignore unknown fields. Adding fields > > > to existing messages is non-breaking; renames and type changes > > > are breaking." > > I'm raising this because I've watched the analogous trap fire in > > > a Python-server / generated-Java-client setup: the Java codegen > tool there emits its own pre-Jackson whitelist check > > > (validateJsonElement throwing IllegalArgumentException on unknown > > > fields, regardless of how the downstream Jackson mapper is > > > configured). One additive field change in the server response > > > broke every consumer until clients re-generated. The "lenient > > > codec" defense only works if every binding uses a lenient codec > > > — and that's a property worth stating explicitly so future SDK > > > authors know what they're committing to. > > > > 2. The harder cases are worth a sentence too. FAIL_ON_UNKNOWN=false > > > covers additions, but in Comms.kt several StartupDetails fields > are `lateinit var` — if Core removes one of those, deserialize > > > succeeds silently and the first task-side access throws > > > UninitializedPropertyAccessException at runtime. A sentence in > > > the ADR distinguishing "additive (non-breaking)" from "rename / > > > type-change / required-field removal (breaking, deprecation > > > cycle required)" gives SDK maintainers a clear rule. > > > > 3. Worth considering a contract test that simulates Core-ahead-of- > > > SDK on the IPC layer specifically. SerializationCompatibilityTest > covers DAG JSON well; CommsTest exercises the current protocol. > > > A small fixture-based test that feeds the Java decoder a frame > > > with an unknown field / missing optional field / null in an > > > optional position would catch codec-config regressions before > > > they hit users. > > > > None of this blocks the AIP — codec config + tests can land in a > > > follow-up. Mostly want to lock the contract in writing before more > language SDKs make assumptions that diverge. > > > > > Thank you for driving this! Appreciate the effort. > > Stefan > > >> On May 4, 2026, at 2:33 AM, Aritra Basu <[email protected]> wrote: >> >> Hey Jason, >> >> I'd forgotten to comment on it, I'd taken a look at the proposal and it >> looks good to me! Looking forward to see it go in! Was looking at the pure >> java dags code and it looked good. Will be taking a deeper look into it >> soon! >> >> -- >> Regards, >> Aritra Basu >> >> On Mon, 4 May 2026, 1:30 pm Zhe-You(Jason) Liu, <[email protected]> wrote: >> >>> Hi everyone, >>> >>> Hoping to get a few more eyes on the open "AIP-108: Java Task SDK and the >>> Language Coordinator Layer" and ADRs before we move forward. >>> Please take a look and share any thoughts when you have a moment. >>> >>> - AIP: >>> >>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-108+Java+Task+SDK+and+the+Language+Coordinator+Layer >>> - ADRs: >>> >>> https://github.com/apache/airflow/pull/65956/changes/876179ab55d3b31486ee52f4c27abd4e215b0fd0 >>> - PR for adding Coordinator: https://github.com/apache/airflow/pull/65958 >>> - PR for adding Java-SDK: https://github.com/apache/airflow/pull/65956 >>> >>> Even a short comment or a +1 on AIP helps us understand where the community >>> stands. >>> Thank you! >>> >>> Best, >>> Jason >>> >>> On Tue, Apr 28, 2026 at 10:58 PM Jarek Potiuk <[email protected]> wrote: >>> >>>> Oh super - nice. I am definitely going to take a look (at both) :D >>>> >>>> On Tue, Apr 28, 2026 at 9:55 AM Tzu-ping Chung via dev < >>>> [email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I’ve merged the ADRs and the previous LC message (in the other thread) >>>>> into a document formatted as an AIP: >>>>> >>>>> AIP-108 Java Task SDK and the Language Coordinator Layer - Airflow - >>>>> Apache Software Foundation < >>> https://cwiki.apache.org/confluence/x/pY4mGQ> >>>>> cwiki.apache.org <https://cwiki.apache.org/confluence/x/pY4mGQ> >>>>> [image: favicon.ico] <https://cwiki.apache.org/confluence/x/pY4mGQ> >>>>> <https://cwiki.apache.org/confluence/x/pY4mGQ> >>>>> >>>>> >>>>> Various sections are taken mostly directly from the other documents, so >>>>> you can probably glance over them if you’ve already read them. If you >>>>> haven’t, the AIP document would be a good place to start since it >>> removes >>>>> some more detailed descriptions and focuses on the high level >>> interfaces. >>>>> More details are still available in the ADR documents included in the >>> PRs >>>>> Jason opened. >>>>> >>>>> TP >>>>> >>>>> >>>>> On 28 Apr 2026, at 11:12, Zhe-You(Jason) Liu <[email protected]> >>> wrote: >>>>> >>>>> Hi Jens, >>>>> >>>>> The ADRs are now available here [1] , hope that helps clarify some of >>> your >>>>> concerns. >>>>> >>>>> As I understood Java is static compiled JAR files, no on-the-fly compile >>>>> >>>>> from Java source tree (correct?) >>>>> >>>>> That's correct -- the user will compile themselves and set the `[java] >>>>> bundles_folder` config to point to the directory of those JAR files. >>>>> >>>>> so actually the Dag parsing concept then is quite "static" and once >>>>> >>>>> generated actually no need to re-parse the Dag in Java mode? >>>>> >>>>> Still if static JAR deployed how does the deploy lifecacly look like? >>>>> >>>>> Would you need to restart with new deploy the Dag Parsr and respective >>>>> workers? >>>>> >>>>> The lifecycle for dag-parsing will be the same as how the current >>>>> `DagFileProcessorProcess` acts. The coordinator comes into play before >>> we >>>>> start the actual parse file entrypoint [2]. Regardless of what language >>>>> the >>>>> parse file subprocess is implemented in, as long as it returns a valid >>>>> serialized Dag JSON in msgpack over IPC, the behavior will remain the >>>>> same. >>>>> So there is no need to restart the `airflow dag-processor`, and the >>>>> `airflow worker` will not be involved in dag processing at all. >>>>> >>>>> Is it really realistic that a "LocalExecutor" needs to be supported or >>>>> >>>>> can we limit it to e-g- only remote executors to reduce coupling and >>>>> complexity of operating the core? >>>>> >>>>> The coordinator is the interface that decides how we want to launch the >>>>> subprocess for both dag-processing and workload-execution. This means it >>>>> will support **any** executor out of the box, as we integrate the >>>>> coordinator at the TaskSDK level for workload-execution [3]. >>>>> >>>>> What overhead does the "Coordinator layer" generate compared to a Java >>>>> >>>>> specific supervisor implementation? >>>>> >>>>> The "Coordinator layer" is the interface for Airflow-Core to interact >>> with >>>>> the target language subprocess. We still need a Java-specific supervisor >>>>> implementation, which is the first PR I mentioned in another thread -- >>>>> Java >>>>> SDK [4]. >>>>> >>>>> Is it not only a new / additional process but also IPC involved then. >>> And >>>>> >>>>> at least I saw also performance problems e.g. using very large XComs >>> where >>>>> even heartbeats are lost due to long running IPC >>>>> >>>>> I would consider this out of scope of the "Java SDK and the Coordinator >>>>> Layer" AIP, as the current Python-native TaskSDK supervisor will >>> encounter >>>>> the same issue. The multi-language support here follows the same >>> protocol >>>>> that the current TaskSDK uses. >>>>> >>>>> [1] >>>>> >>>>> >>> https://github.com/apache/airflow/pull/65956/changes/876179ab55d3b31486ee52f4c27abd4e215b0fd0 >>>>> [2] >>>>> >>>>> >>> https://github.com/apache/airflow/pull/65958/changes#diff-564fd0a8fbe4cc47864a8043fcc1389b33120c88bb35852b26f45c36b902f70bR540-R573 >>>>> [3] >>>>> >>>>> >>> https://github.com/apache/airflow/pull/65958/changes#diff-5bef10ab2956abf7360dbf9b509b6e1113407874d24abcc1b276475051f13abfR1991-R2001 >>>>> [4] https://github.com/apache/airflow/pull/65956 >>>>> >>>>> Thanks. >>>>> >>>>> Best, >>>>> Jason >>>>> >>>>> Best, >>>>> Jason >>>>> >>>>> On Tue, Apr 28, 2026 at 3:03 AM Jens Scheffler <[email protected]> >>>>> wrote: >>>>> >>>>> Thanks TP for raising this! >>>>> >>>>> I would need a sleep-over the block of information in the described >>>>> details and might have some detail questions just to ensure I understood >>>>> right. So in a ADR or AIP document might be better to comment than in an >>>>> email thread. >>>>> >>>>> Things that jump into my head but there would be more coming thinking >>>>> about it: >>>>> >>>>> * As I understood Java is static compiled JAR files, no on-the-fly >>>>> compile from Java source tree (correct?) - in this case also the >>>>> Dags are "static until re-deploy" - so actually the Dag parsing >>>>> concept then is quite "static" and once generated actually no need >>>>> to re-parse the Dag in Java mode? >>>>> * Still if static JAR deployed how does the deploy lifecacly look >>>>> like? Would you need to restart with new deploy the Dag Parsr and >>>>> respective workers? >>>>> * Is it really realistic that a "LocalExecutor" needs to be supported >>>>> or can we limit it to e-g- only remote executors to reduce coupling >>>>> and complexity of operating the core? >>>>> * What overhead does the "Coordinator layer" generate compared to a >>>>> Java specific supervisor implementation? Is it not only a new / >>>>> additional process but also IPC involved then. And at least I saw >>>>> also performance problems e.g. using very large XComs where even >>>>> heartbeats are lost due to long running IPC >>>>> (https://github.com/apache/airflow/issues/64628) >>>>> * (There might be more coming :-) ) >>>>> >>>>> Jens >>>>> >>>>> P.S.: Questions here also do not mean rejection but like to understand >>>>> which complexity and overhead we have adding all this. >>>>> >>>>> On 27.04.26 15:21, Jarek Potiuk wrote: >>>>> >>>>> Also I would like to point out one thing. >>>>> >>>>> This should not be `LAZY CONSENSUS` just yet, this is quite a big thing >>>>> >>>>> to >>>>> >>>>> discuss. I missed the subject already had it. >>>>> >>>>> At this stage this is really a discussion (I renamed the thread. Because >>>>> ... we have never discussed it before. >>>>> >>>>> LAZY CONSENSUS should be really called for after initial discussion (on >>>>> devlist) points to us actually reaching the consensus. >>>>> >>>>> While this one is unlikely to cause much controversy (I think), >>>>> >>>>> sufficient >>>>> >>>>> time for people to discuss and digest it before calling for lazy >>>>> >>>>> consensus >>>>> >>>>> is a necessary prerequisite. We simply need time to build consensus. >>>>> >>>>> We have a few processes where we build a "general" consensus first, and >>>>> then we apply it only to particular cases (such as new providers). But >>> in >>>>> most cases when we have a "big" thing to discuss, we need to build >>>>> consensus on devlist first. While in some cases people discussed things >>>>> off-list and came to some conclusions (whch is perfectly fine) - >>> bringing >>>>> it to the list as a consensus, where we do not know if we achieved it >>>>> >>>>> yet, >>>>> >>>>> is - I think - a bit premature. >>>>> >>>>> See the lazy consensus explanation [1] and "consenesus building [2] >>>>> >>>>> [1] Lazy consensus - >>>>> >>>>> >>> https://community.apache.org/committers/decisionMaking.html#lazy-consensus >>>>> >>>>> [2] Consensus building - >>>>> >>>>> >>>>> >>> https://community.apache.org/committers/decisionMaking.html#consensus-building >>>>> >>>>> >>>>> J. >>>>> >>>>> On Mon, Apr 27, 2026 at 1:27 PM Aritra Basu<[email protected]> >>>>> wrote: >>>>> >>>>> Hey TP, >>>>> >>>>> Overall +1, This is quite an interesting implementation. A couple >>>>> questions, is provider the right place for the coordinator? Don't have >>>>> strong opinions or alternatives, but I am curious. >>>>> >>>>> Also for the parser wanted to understand a bit better how it works? I >>>>> >>>>> tried >>>>> >>>>> going through the SDK but wasn't able to fully understand it. Also +1 to >>>>> Jarek's recommendation for documentation. >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Aritra Basu >>>>> >>>>> On Mon, 27 Apr 2026, 11:39 am Tzu-ping Chung via dev, < >>>>> [email protected]> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> As mentioned in the latest dev call, we have been developing a Java SDK >>>>> with changes to Airflow in a separate fork[1]. We plan to start merging >>>>> >>>>> the >>>>> >>>>> Java SDK work back into the OSS repository. >>>>> >>>>> We see this as a natural step following initial work in AIP-72[2], >>>>> >>>>> which >>>>> >>>>> created “a clean language agnostic interface for task execution, with >>>>> support for multiple language bindings” (quoted from the proposal). >>>>> >>>>> The Java SDK also uses Ash’s addition of @task.stub[3] for the Go SDK, >>>>> >>>>> to >>>>> >>>>> declare a task in a DAG to be “implemented elsewhere” (not in the >>>>> >>>>> annotated >>>>> >>>>> function). Similar to the Go SDK, we also created a Java library that >>>>> >>>>> users >>>>> >>>>> can use to write task implementations for Airflow to execute at >>>>> >>>>> runtime. >>>>> >>>>> >>>>> [1]:https://github.com/astronomer/airflow/tree/feature/java-all >>>>> [2]:https://cwiki.apache.org/confluence/x/xgmTEg >>>>> [3]:https://github.com/apache/airflow/pull/56055 >>>>> >>>>> The user-facing syntax for a stub task would be the same as implemented >>>>> >>>>> by >>>>> >>>>> the Go SDK: >>>>> >>>>> @task.stub(queue="java-tasks") >>>>> def my_task(): ... >>>>> >>>>> With a new configuration option to map tasks in a pool to be executed >>>>> >>>>> by >>>>> >>>>> a >>>>> >>>>> specific SDK: >>>>> >>>>> [sdk] >>>>> queue_to_sdk = {"java-tasks": "java"} >>>>> >>>>> The configuration is needed for some executors the Go SDK currently >>>>> >>>>> does >>>>> >>>>> not support. The Go SDK currently relies on each executor worker >>>>> >>>>> process >>>>> >>>>> to >>>>> >>>>> specify which queues they listen to, but this is not always viable, >>>>> >>>>> since >>>>> >>>>> some executors—LocalExecutor, for example—do not have the concept of >>>>> >>>>> worker >>>>> >>>>> processes. >>>>> >>>>> The Coordinator Layer >>>>> ===================== >>>>> >>>>> When the Go SDK was implemented, it left out Runtime Airflow plugins >>>>> >>>>> as a >>>>> >>>>> future topic. This includes custom XCom backends, secrets backends >>>>> >>>>> lookup >>>>> >>>>> for connections and variables, etc. These components are implemented in >>>>> Python, and a Java task cannot easily use the feature unless we also >>>>> implement the lookup logic in Java. We don’t want to do that since it >>>>> introduces significant overhead to writing plugins, and the overhead >>>>> multiplies with each new language SDK. >>>>> >>>>> Fortunately, the current execution-time task runner already uses a >>>>> two-layer design. When an executor wants to run a task, it starts a >>>>> (Python) task runner process that talks to Airflow Core through the >>>>> Execution API, and *forks* another (Python) process, which talks to the >>>>> task runner through TCP, to run the actual task code. Airflow plugins >>>>> simply go into the task runner process. >>>>> >>>>> This design works well for us since it keeps all the Airflow plugins in >>>>> Python. The only thing missing is an abstraction for the task runner >>>>> process to run tasks in any language. We are calling this new layer the >>>>> **Coordinator**. >>>>> >>>>> When a DAG bundle is loaded, it not only tells Airflow how to find the >>>>> DAGs (and the tasks in them), but also how to *run* each task. Current >>>>> Python tasks use the Python Coordinator, running tasks by forking as >>>>> previously described. A new JVM Coordinator will instruct the task >>>>> >>>>> runner >>>>> >>>>> how to run tasks packaged in JAR files. >>>>> >>>>> Each coordinator implements a base interface (BaseRuntimeCoordinator) >>>>> >>>>> that >>>>> >>>>> handles three concerns: >>>>> >>>>> - Discovery: determining whether a given file belongs to this >>>>> >>>>> coordinator >>>>> >>>>> (e.g. JAR files for Java). >>>>> - DAG parsing: returning a runtime-specific subprocess command to parse >>>>> DAG files in the target language. >>>>> - Task execution: returning a runtime-specific subprocess command to >>>>> execute tasks in the target runtime. >>>>> >>>>> The base class owns the full bridge lifecycle—TCP servers, subprocess >>>>> management, and cleanup—so language providers only need to implement >>>>> >>>>> these >>>>> >>>>> three methods. >>>>> >>>>> The coordinator translates a DagFileParseRequest (for DAG parsing) and >>>>> StartupDetails (for Task execution) data model (as declared in Airflow) >>>>> into the appropriate commands for the target runtime. For example, a >>>>> >>>>> “java >>>>> >>>>> -classpath ... /path/to/MainClass ...” subprocess command that points >>>>> >>>>> to >>>>> >>>>> the correct JAR file and main class in this case. >>>>> >>>>> Coordinators as Airflow Providers >>>>> ================================= >>>>> >>>>> The base coordinator interface and the Python coordinator will live in >>>>> “airflow.sdk.execution_time”. Other coordinators (for foreign >>>>> >>>>> languages) >>>>> >>>>> are registered through the existing Airflow provider mechanism. Each >>>>> >>>>> SDK >>>>> >>>>> provider declares its coordinator in its provider.yaml under a >>>>> “coordinators” extension point. Both ProvidersManager (airflow-core) >>>>> >>>>> and >>>>> >>>>> ProvidersManagerTaskRuntime (task-sdk) discover coordinators through >>>>> >>>>> this >>>>> >>>>> extension point. This means adding a new language runtime requires >>>>> >>>>> only a >>>>> >>>>> provider package. No changes to Airflow Core are needed. >>>>> >>>>> The new JVM-based coordinator will live in the namespace >>>>> “airflow.providers.sdk.java”. This is not the most accurate name >>>>> (technically it should be “jvm” instead), but in practice most users >>>>> >>>>> will >>>>> >>>>> recognize it, and (from my understanding) other JVM language users >>>>> >>>>> (e.g. >>>>> >>>>> Kotlin, Scala) are already well-versed enough dealing with Java >>>>> interoperability to understand “java” means JVM in this context. >>>>> >>>>> Writing DAGs in Java >>>>> ==================== >>>>> >>>>> This is not strictly connected to AIP-72, but considered by us as a >>>>> natural next step since we can now implement tasks in a foreign >>>>> >>>>> language. >>>>> >>>>> Being able to define the DAG in the same language as the task >>>>> implementation is useful since writing Python, even if only with >>>>> >>>>> minimal >>>>> >>>>> syntax, is still a hurdle for those not already familiar with, or even >>>>> allowed to run it. There are mainly three things we need on top of the >>>>> >>>>> task >>>>> >>>>> implementation interface: >>>>> >>>>> - DAG flags (e.g. schedule, max_active_tasks) >>>>> - Task flags (e.g. trigger_rule, weight_rule) >>>>> - Task dependencies >>>>> >>>>> A proof-of-concept implementation is included with other changes >>>>> >>>>> proposed >>>>> >>>>> elsewhere in this document. >>>>> >>>>> Lazy Consensus Topics >>>>> ===================== >>>>> >>>>> We’re calling for lazy consensus for the following topics >>>>> >>>>> - A new “queue_to_sdk” configuration option to route tasks to a >>>>> >>>>> specific >>>>> >>>>> language SDK >>>>> - A new coordinator layer in the SDK to route implementations at >>>>> >>>>> execution >>>>> >>>>> time. >>>>> - New providers under airflow.providers.sdk to provide additional >>>>> >>>>> language >>>>> >>>>> support. >>>>> - Develop the Go SDK to support the proposed model and a provider >>>>> >>>>> package >>>>> >>>>> for the coordinator. (Existing features stay as-is; no breaking >>>>> >>>>> changes.) >>>>> >>>>> - Add the new Java SDK and the corresponding provider package. >>>>> >>>>> TP >>>>> >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail:[email protected] >>>>> For additional commands, e-mail:[email protected] >>>>> >>>>> >>>>> >>>>> >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
