potiuk commented on PR #27468: URL: https://github.com/apache/airflow/pull/27468#issuecomment-1307348759
Yep. AIP is even approved - conditionally re- on-the-wire transport mechanism (i.e remote method execution choice) and serialization choice we have. And while I iniitially advocated for GRPC (which answer both questions), there are some drawbacks of using GRPC (debuggability and threading model used) - this has been pointed by @ashb and @andrewgodwin mostly. There are some benefitst of using more "web standard" approach - JSON-RPC allows us to plug-in better into already existing mechanisms we use and deos not require from us to learn and operate any new technology. The tests performed (initally by me then later by Mateusz) have shown that the performance overhead from using binary RPC vs. JSON are not really meaningful in our context, so they are not decisive factor. I think the last thing that we need to see if whether we can replace Pickling with our own JSON serialization we alredy use in airflow - it seems like bad idea initially (because of the overhead) but again, it's likely negigible and the benefit of using more "standard" for us approach (we serialize internal objects using JSON Serializer a lot in various cases) and being able to debug it "on-the-wire" without extra tools are I think important benefits that should not be under-rated. I think here our own JSON Serializer trumps Pickle, because it already transparently handles a number of structures we will have to serialize including some K8S configuraitons (which require custom serialization when using pickle. The "easiness of development" and maintainability were for me important but as @mhenc mentioned - we did both POCs - GRPC here (I did it): https://github.com/apache/airflow/pull/25094 and this one is using JSON-RPC and for me, it looks like the approach proposed by Mateusz is even easier to develop and maintain - and as long as our serialization approch works transparently for all the object that we need to send/retrieve. So i got convinced that what Ash and Andrew proposed is a good idea. It does not change the principle of the AIP - it's just a matter of implementation detail. From the development perspective, this will be mostly decorating of the methods that we want to turn into "remote" ones. @bolkedebruin - I think if you read through the discussions - maybe you will have other specifc comments about those choices - if they have anything to add on top of the above POCs. Tests and conclusions we have based on those. We deliberately took it slower with Mateusz and gave both of us (and the community) time to comment and raise their concerns, the AIP general approach is already approved, but if there are any specific concerns on those technology choices - we are still open to hear a bit more comments before we make the first foundng "real PR" rather than POC - as we promised when we called for voting on the AIP. Do you have some specific concerns or other criteria that we should consider. Just to summarize those that we took into account: * easines of development/maintenance * debuggability * familiarlity of the community with underlying technologies * speed/performance of serialisation and transport They are now listed in order of preference - mostly because we found out that speed/performance of serialization and transport is literally orders of magnitude less important than speed/performance of the operations we are going to serialize and there is very little difference of the overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
