Also Giorgio - we are not ignoring Flatbuffers, but one of the criteria we had when we discussed "GRPC vs. JSON" was that we did not want to introduce new technologies that might require extra learning/skills while debugging problems.
My thought process here: We are using our own custom JSONSerialization in a few places, and it can already handle some of "non-standard" structures for us (for example K8S objects). If we go flatbuffers, this is basically repeating the same story - any custom objects that are not directly serializable need to be handled in a custom way. Yes, our serialization not "standard" outside Airflow and likely any "reusable" serialization might be better optimized. We looked at - very popular in the Python API world Pydantic. The v2 has all the internals rewritten in rust for example and it is sometimes 16 times faster. but In a way that is a premature optimisation IMHO. We are not aiming for cross-language (FlatBuffers, GRPC are). We are not aiming for extremely complex and huge structures to send. We already know that serialization will not be our bottleneck and the serialisation speed has very limited impact. So getting a "faster" or "more standard" solution is not as important as getting the "more familiar" one. I think if anything, that alone means our custom serialization wins. And this is an easily replaceable choice. Shall we ever feel the need to make it cross-language, it's just one component to replace :) J. On Tue, Nov 8, 2022 at 4:53 PM Jarek Potiuk <[email protected]> wrote: > > Very good point. I have not thought about it but this is a very strong > reason to use our JSONSerialization. > > J, > > On Tue, Nov 8, 2022 at 4:28 PM Mateusz Henc <[email protected]> wrote: > > > > I just learned from https://docs.python.org/3/library/pickle.html > > Warning The pickle module is not secure. Only unpickle data you trust. > > It is possible to construct malicious pickle data which will execute > > arbitrary code during unpickling. Never unpickle data that could have come > > from an untrusted source, or that could have been tampered with. > > > > So there we have a "trusted" component - Internal API, exposing an endpoint > > that can be called from Worker, so from any arbitrary code. Unless there > > are some ways to protect from this, it seems that JSON serialization is the > > only option there. > > > > Best regards, > > Mateusz Henc > > > > > > On Thu, Nov 3, 2022 at 9:45 AM Mateusz Henc <[email protected]> wrote: > >> > >> Thank you Giorgio. > >> > >> TBH I've never heard about FlatBuffer, but I will take a look. > >> The big advantage I see for Pickle is seamless integration - no additional > >> conversion code required for developers, especially that we get the > >> arguments as a dictionary. Pickle handles it without any problem (at least > >> in my tests). If FlatBuffer offers a similar experience then we definitely > >> should consider it. > >> The other question is if we should introduce yet another dependency to > >> Airflow - the number of pypi packages is big anyway, which leads to many > >> problems when users want to install their customer packages (dependency > >> conflicts etc). > >> > >> Best regards, > >> Mateusz Henc > >> > >> > >> On Wed, Nov 2, 2022 at 6:59 PM Giorgio Zoppi <[email protected]> > >> wrote: > >>> > >>> Hello, > >>> this is something i'd like to work too in my spare time but some i'd > >>> rather use flatbuffers for the payload since its duality json/binary. > >>> Flatbuffers have the nice feature that they're able to parse JSON files > >>> that conform to a schema into FlatBuffer binary files, so you can have > >>> duality json -> binary_on_wire-> dataclasses at receiver side. I'd rather > >>> put pickle as last resort or avoid at all. @Mateus please feel free to > >>> sync with me privately on missing actions items to make this feature a > >>> success. > >>> Just 1c, > >>> Best Regads, > >>> Giorgio > >>>
