Re: [DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Andrey Anshin Mon, 04 Dec 2023 11:16:39 -0800

> Pickle and the likes can execute arbitrary code that is inside the
> serialized object.
>


> Yep. This is super dangerous indeed.

My fifty cents. This sounds scarier than it actually is, it mostly covered
by this simple things:
- Do not open attachment from unknown senders /  untrusted sources
- Do not open the links from unknown senders /  untrusted sources
- Do not fill personal information / credit card details on crappy sites
- Do not install random software from the Internet
- Do not run strange perl scripts into the console
- Do not run software on superuser
- Use repeatable installation, which prevent situation that owner of some
package lost control of their PyPI and someone upload new modified version
of software
- Do not try to deserialize pickled objects from unknown sources

The main problem with dill / cloudpickle is that it has limitations on the
Python version / library version same effect with pickle but I think it
depends on the protocol version.



On Mon, 4 Dec 2023 at 22:15, Jarek Potiuk <[email protected]> wrote:

> Let me start separate threads from the main discussion. It will be easier
> to follow.
>
> I'd really like to understand the actors involved and what kind of threats
> the allowlist is supposed to protect us.
>
> I think it would be great if we see the "attack" scenario we think
> allowlist will protect us against - explaining the actors involved and what
> actions they could make that the allowlist prevent.
>
> It might be that I do not see it (but as experiences from our security team
> had shown - having a clear step-by-step explanation of what happens allows
> us to reason better about it and see if the protection is actually
> "functional". I think for now it's not doing much (except making it a bit
> more harder for someone who would like to exploit it) , but I might be
> mistaken (happened many times during our security discussions.
>
>
> Pickle and the likes can execute arbitrary code that is inside the
> > serialized object.
> >
>
> Yep. This is super dangerous indeed.
>
>
> > Airflow's serializers do not execute arbitrary code. Also your assumption
> > of the 'serious' security problem is incorrect.
>
>
> I think that's a misunderstanding. I simply think that whatever we are
> trying to protect against is simply not effective because it might be
> bypassed easily by DAG authors (if this is the DAG authors action we are
> trying to protect against). I am not sure if it is serious or not because I
> am not sure what we are trying to protect against.
>
> But I believe whatever we do might simply not work as intended and even
> more - I think it's quite unnecessary.
>
> So I want to know the intentions, actors and scenarios :).
>
> I think clearly spelling out actors involved and scenario of the attack we
> are trying to protect against should help with clarifying it. I think it's
> been hinted at in the next few paragraphs, but I wanted to make sure that I
> understand it.
>
>
> > The deserializer will need
> > to be able to load the class by using 'import_string' after verifying
> that
> > it matches the allowed pattern (by default 'airflow.*'). If a DAG author
> > does something like `airflow.providers.bla.bla` that class cannot be
> loaded
> > neither can the DAG author make that class available to a different DAG
> in
> > a different context (which is where a real attack vector would be).
> >
>
> So do I understand correctly that this one is aimed to protect against one
> DAG author (writing DAGA and storing data to XCom to not to allow it to get
> that XCom in DAG B and execute a code that  DAG A author wanted DAG B to
> execute during deserialization?
>
> In other words - we want to make sure that DAGS are somewhat  isolated from
> each other?
>
> For now we are ignoring the fact that tasks for different DAGs are
> executed potentially on the same worker, in a fork of the same process,
> which has potentially other ways where one task influences what other tasks
> are doing. This is a different scenario / layer and I do understand
> the concept of layered protection but wanted to make sure that it's been a
> deliberate consideration.
>
> Is this a fair assesment?
>
> Are there any other reasons for that allowlist ?
>
> Finally - did we consider the potential atta scenario where importing
> subpackagest might come from multiple locations on PYTHONPATH (I am not
> sure it is possible to exploit in our case but as of
> https://peps.python.org/pep-0420/ you can  import sub-packages of packages
> from several locations) ?
>
> An attack vector is considered (as mentioned above) if you can change
> > something in someone else's context. So, if I would be able to serialize
> > something and then influence / adjust its deserialization in a different
> > context which should not be under my control.
> >
>
> Yeah. I think here the "my control" and "others" and "different context"
> (i.e. actors involved) were not clear to me (still not entirely).
>
>  We have a security model (just updated it to clarify some of the DAG
> author expectations) where currently we do not distinguish between
> different DAG authors - we assume that all DAG Authors have access to the
> same execution environment and generally speaking we cannot assume that
> other DAG authors are isolated from each other. It's just impossible
> (currently) for multiple reasons.
>
> We are working on changing it with AIP-44 and one more not-yet-written-aip,
> and maybe it's good that this serialisation is good to  add extra
> additional layer of security, but I think it has the potential of giving
> our users false sense of security - if we do not properly describe it in
> the model. So we need to be very clear when we are describing the intended
> level of isolation provided.
>
>
> J.
>

Re: [DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Reply via email to