bolkedebruin commented on issue #40974:
URL: https://github.com/apache/airflow/issues/40974#issuecomment-2252426086

   Ha! There are some things to consider when wanting to do "consolidation". 
The main leading principle in the past was that we do not want to have 
executable code when deserializing. This is what *all* third-party 
(de)serializers seem to do, pickle, dill, cloudpickle etc. The second principle 
was to have a human readable format and the third principle was to have it 
versioned. 
   
   I've added "serde" (no 2. that @potiuk is speaking of in the past) to have a 
generic way of serializing any object with the principles in mind. This is 
particularly useful for XCom as that shares arbitrary data across workers. The 
'other' serializer which I would call the DAG serializer has three main 
short-comings: 1) It is slow - serde takes about 10% of the time the DAG 
serializer takes, 2) it is hard to extend, you would need to change the core 
code to add an extension and 3) it will add O(n) in time to do so. The upside 
is that it is tried and tested, serializes DAGs and does JSON schema validation.
   
   It might then seem the obvious route to add DAG serialization to "serde". 
Which it did try, but also felt a bit like squeezing something into something 
else where it doesnt entirely fit (keeping backwards compatibility in mind). It 
is possible, but a lot of past cruft would need to be re-implemented to make it 
work. 
   
   Now I see other projects like Spark Connect settle on Cloudpickle and they 
forego the issue of arbitrary code execution. The question then becomes how 
relevant is that attack vector? Is the tradeoff in maintaining our own 
serializer worth it? Also it will not generate a human readable format. Do we 
need to review our principles (which I think have never been officially 
settled, but correct me if I am wrong).
   
   Concluding: if you add DAG seralization to "serde" it is probably the most 
Airflow way to go. It gives you extensibility for the future as it has the 
better format (over the DAG serializer) and can with a little bit of help 
serialize any kind of object. It seems to serve us well nowadays. If you take a 
step back and want to re-evaluate it might be worth re-visiting our principles 
and checkng what we can do to reduce the attack vector and maybe go for 
cloudpickle. This would externalize the support and reduce the maintenance 
burden. However, we might run into issues when we cannot serialize through 
cloudpickle and we do not control how it works.
   
   My 2 cents.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to