> Pickle and the likes can execute arbitrary code that is inside the > serialized object. >
> Yep. This is super dangerous indeed. My fifty cents. This sounds scarier than it actually is, it mostly covered by this simple things: - Do not open attachment from unknown senders / untrusted sources - Do not open the links from unknown senders / untrusted sources - Do not fill personal information / credit card details on crappy sites - Do not install random software from the Internet - Do not run strange perl scripts into the console - Do not run software on superuser - Use repeatable installation, which prevent situation that owner of some package lost control of their PyPI and someone upload new modified version of software - Do not try to deserialize pickled objects from unknown sources The main problem with dill / cloudpickle is that it has limitations on the Python version / library version same effect with pickle but I think it depends on the protocol version. On Mon, 4 Dec 2023 at 22:15, Jarek Potiuk <[email protected]> wrote: > Let me start separate threads from the main discussion. It will be easier > to follow. > > I'd really like to understand the actors involved and what kind of threats > the allowlist is supposed to protect us. > > I think it would be great if we see the "attack" scenario we think > allowlist will protect us against - explaining the actors involved and what > actions they could make that the allowlist prevent. > > It might be that I do not see it (but as experiences from our security team > had shown - having a clear step-by-step explanation of what happens allows > us to reason better about it and see if the protection is actually > "functional". I think for now it's not doing much (except making it a bit > more harder for someone who would like to exploit it) , but I might be > mistaken (happened many times during our security discussions. > > > Pickle and the likes can execute arbitrary code that is inside the > > serialized object. > > > > Yep. This is super dangerous indeed. > > > > Airflow's serializers do not execute arbitrary code. Also your assumption > > of the 'serious' security problem is incorrect. > > > I think that's a misunderstanding. I simply think that whatever we are > trying to protect against is simply not effective because it might be > bypassed easily by DAG authors (if this is the DAG authors action we are > trying to protect against). I am not sure if it is serious or not because I > am not sure what we are trying to protect against. > > But I believe whatever we do might simply not work as intended and even > more - I think it's quite unnecessary. > > So I want to know the intentions, actors and scenarios :). > > I think clearly spelling out actors involved and scenario of the attack we > are trying to protect against should help with clarifying it. I think it's > been hinted at in the next few paragraphs, but I wanted to make sure that I > understand it. > > > > The deserializer will need > > to be able to load the class by using 'import_string' after verifying > that > > it matches the allowed pattern (by default 'airflow.*'). If a DAG author > > does something like `airflow.providers.bla.bla` that class cannot be > loaded > > neither can the DAG author make that class available to a different DAG > in > > a different context (which is where a real attack vector would be). > > > > So do I understand correctly that this one is aimed to protect against one > DAG author (writing DAGA and storing data to XCom to not to allow it to get > that XCom in DAG B and execute a code that DAG A author wanted DAG B to > execute during deserialization? > > In other words - we want to make sure that DAGS are somewhat isolated from > each other? > > For now we are ignoring the fact that tasks for different DAGs are > executed potentially on the same worker, in a fork of the same process, > which has potentially other ways where one task influences what other tasks > are doing. This is a different scenario / layer and I do understand > the concept of layered protection but wanted to make sure that it's been a > deliberate consideration. > > Is this a fair assesment? > > Are there any other reasons for that allowlist ? > > Finally - did we consider the potential atta scenario where importing > subpackagest might come from multiple locations on PYTHONPATH (I am not > sure it is possible to exploit in our case but as of > https://peps.python.org/pep-0420/ you can import sub-packages of packages > from several locations) ? > > An attack vector is considered (as mentioned above) if you can change > > something in someone else's context. So, if I would be able to serialize > > something and then influence / adjust its deserialization in a different > > context which should not be under my control. > > > > Yeah. I think here the "my control" and "others" and "different context" > (i.e. actors involved) were not clear to me (still not entirely). > > We have a security model (just updated it to clarify some of the DAG > author expectations) where currently we do not distinguish between > different DAG authors - we assume that all DAG Authors have access to the > same execution environment and generally speaking we cannot assume that > other DAG authors are isolated from each other. It's just impossible > (currently) for multiple reasons. > > We are working on changing it with AIP-44 and one more not-yet-written-aip, > and maybe it's good that this serialisation is good to add extra > additional layer of security, but I think it has the potential of giving > our users false sense of security - if we do not properly describe it in > the model. So we need to be very clear when we are describing the intended > level of isolation provided. > > > J. >
