Re: [DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Bolke de Bruin Tue, 05 Dec 2023 01:41:31 -0800

On Mon, 4 Dec 2023 at 19:15, Jarek Potiuk <[email protected]> wrote:

>
> > Airflow's serializers do not execute arbitrary code. Also your assumption
> > of the 'serious' security problem is incorrect.
>
>
> I think that's a misunderstanding. I simply think that whatever we are
> trying to protect against is simply not effective because it might be
> bypassed easily by DAG authors (if this is the DAG authors action we are
> trying to protect against). I am not sure if it is serious or not because I
> am not sure what we are trying to protect against.
>
> But I believe whatever we do might simply not work as intended and even
> more - I think it's quite unnecessary.
>
> So I want to know the intentions, actors and scenarios :).
>
> I think clearly spelling out actors involved and scenario of the attack we
> are trying to protect against should help with clarifying it. I think it's
> been hinted at in the next few paragraphs, but I wanted to make sure that I
> understand it.
>
>


> > The deserializer will need
> > to be able to load the class by using 'import_string' after verifying
> that
> > it matches the allowed pattern (by default 'airflow.*'). If a DAG author
> > does something like `airflow.providers.bla.bla` that class cannot be
> loaded
> > neither can the DAG author make that class available to a different DAG
> in
> > a different context (which is where a real attack vector would be).
> >
>
> So do I understand correctly that this one is aimed to protect against one
> DAG author (writing DAGA and storing data to XCom to not to allow it to get
> that XCom in DAG B and execute a code that  DAG A author wanted DAG B to
> execute during deserialization?
>
> In other words - we want to make sure that DAGS are somewhat  isolated from
> each other?
>
> For now we are ignoring the fact that tasks for different DAGs are
> executed potentially on the same worker, in a fork of the same process,
> which has potentially other ways where one task influences what other tasks
> are doing.


Yes, this really is a different layer where the OS provides certain
guarantees
and where Airflow has other protections (or maybe not).


> This is a different scenario / layer and I do understand
> the concept of layered protection but wanted to make sure that it's been a
> deliberate consideration.
>
> Is this a fair assesment?
>
>
Yes. The intention here is to protect the confidentiality of the DAG
(worker, content)
of User 1 (U1) against Malicious Actor 1 (MA1). These are different users
of the system,
both can author but they do not have direct access to each other's DAGs.

You can craft a dict that gets deserialized into an arbitrary object. This
dict can be pushed
into the DB (manually) as an XCom value. This object could run a
deserialize method that
does the same as the 'PickleBomb'.

The allow list protects against this vector as we have a "vetted" list of
objects that can be
deserialized into. So while the integrity of the DAG can be affected
(different value), you
shouldn't be able to affect the confidentiality.


> Are there any other reasons for that allowlist ?
>
>
I think the above is sufficient.


> Finally - did we consider the potential atta scenario where importing
> subpackagest might come from multiple locations on PYTHONPATH (I am not
> sure it is possible to exploit in our case but as of
> https://peps.python.org/pep-0420/ you can  import sub-packages of packages
> from several locations) ?
>

The assumption here is that the OPS environment is immutable / in a known
state.
Thus the DAG of U1 runs in a separate process from the DAG of MA1. That
means
( = it is assumed) that MA1 cannot change the PYTHONPATH of U1.


> An attack vector is considered (as mentioned above) if you can change
> > something in someone else's context. So, if I would be able to serialize
> > something and then influence / adjust its deserialization in a different
> > context which should not be under my control.
> >
>
> Yeah. I think here the "my control" and "others" and "different context"
> (i.e. actors involved) were not clear to me (still not entirely).
>
>  We have a security model (just updated it to clarify some of the DAG
> author expectations) where currently we do not distinguish between
> different DAG authors - we assume that all DAG Authors have access to the
> same execution environment and generally speaking we cannot assume that
> other DAG authors are isolated from each other. It's just impossible
> (currently) for multiple reasons.
>
> We are working on changing it with AIP-44 and one more not-yet-written-aip,
> and maybe it's good that this serialisation is good to  add extra
> additional layer of security, but I think it has the potential of giving
> our users false sense of security - if we do not properly describe it in
> the model. So we need to be very clear when we are describing the intended
> level of isolation provided.
>
>
We are not just talking about the different DAG authors here. The DAG of U1
might run in an environment different from the DAG of MA1. If MA1 wants to
get
extra information on that environment by doing something in the context of
U1 it
affects OPS which is, in good practices, not (managed by) the same person
as U1.

Bolke
-- 

--
Bolke de Bruin
[email protected]

Re: [DISCUSS] Allowlist in serializatin [was: Serialization in Apache Airflow]

Reply via email to