bolkedebruin commented on issue #40974: URL: https://github.com/apache/airflow/issues/40974#issuecomment-2252426086
Ha! There are some things to consider when wanting to do "consolidation". The main leading principle in the past was that we do not want to have executable code when deserializing. This is what *all* third-party (de)serializers seem to do, pickle, dill, cloudpickle etc. The second principle was to have a human readable format and the third principle was to have it versioned. I've added "serde" (no 2. that @potiuk is speaking of in the past) to have a generic way of serializing any object with the principles in mind. This is particularly useful for XCom as that shares arbitrary data across workers. The 'other' serializer which I would call the DAG serializer has three main short-comings: 1) It is slow - serde takes about 10% of the time the DAG serializer takes, 2) it is hard to extend, you would need to change the core code to add an extension and 3) it will add O(n) in time to do so. The upside is that it is tried and tested, serializes DAGs and does JSON schema validation. It might then seem the obvious route to add DAG serialization to "serde". Which it did try, but also felt a bit like squeezing something into something else where it doesnt entirely fit (keeping backwards compatibility in mind). It is possible, but a lot of past cruft would need to be re-implemented to make it work. Now I see other projects like Spark Connect settle on Cloudpickle and they forego the issue of arbitrary code execution. The question then becomes how relevant is that attack vector? Is the tradeoff in maintaining our own serializer worth it? Also it will not generate a human readable format. Do we need to review our principles (which I think have never been officially settled, but correct me if I am wrong). Concluding: if you add DAG seralization to "serde" it is probably the most Airflow way to go. It gives you extensibility for the future as it has the better format (over the DAG serializer) and can with a little bit of help serialize any kind of object. It seems to serve us well nowadays. If you take a step back and want to re-evaluate it might be worth re-visiting our principles and checkng what we can do to reduce the attack vector and maybe go for cloudpickle. This would externalize the support and reduce the maintenance burden. However, we might run into issues when we cannot serialize through cloudpickle and we do not control how it works. My 2 cents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
