Re: [DISCUSS] Serialization in Apache Airflow: Clarifications and Future Directions

Bolke de Bruin Mon, 04 Dec 2023 00:13:18 -0800

On Mon, 4 Dec 2023 at 02:10, Jarek Potiuk <[email protected]> wrote:

> >
> >
> > I am reaching out to initiate a discussion on a topic that has been
> causing
> > some confusion within the community: serialization in Apache Airflow. The
> > purpose of this email is to shed light on the current state of
> > serialization and, if necessary, open the floor for discussions on the
> way
> > forward.
> >
>
> Cool!
>
>    2. *XCom Serializer (airflow.serialization.serde)** – Since 2.5**:*
> >
>
> It's rather cool - I agree and with noble goals.
>
>
> > Other serializers, like pickle and dill, are selectively employed in
> > certain areas, probably given the lack of serialization for a particular
> > object at the time. While both serializers support encoding into JSON the
> > misconception is that we directly serialize into JSON. We serialize into
> a
> > dict with primitives which is then encoded into JSON if needed or
> something
> > else if we would like to.
> >
>
> Yep. Serializing is surprisingly difficult and has a lot of traps.
>
> So, let's rewind a bit and talk about how the XCom serializer came into the
> > Apache Airflow scene. It wasn't just a random addition; The intention was
> > to
> > have the superhero version of serialization. Unlike its sibling, the DAG
> > serializer, XCom was born to handle tricky stuff that the JSONEncoder
> > struggled with. It wasn't just about being fast and slick; The XCom
> > serializer
> > wanted to be the cool kid on the block, offering better versioning and an
> > easier way to add new features. To get out of the way for users as much
> as
> > possible, so you could just use TaskFlow and not think about how to share
> > results of tasks with the next task. For example, you can just share
> > DataFrames and it handles them like a champ.
> >
>
> Yep. I think serializer as it is used now - especially in Task Flow where
> we want
> to make sure that producer produces an output that consumer can deserialize
> the
> serialized form to exactly the same  object is great. I really like the
> transparency
> here and I do remember problems we had to solve before it came to being.
>
>
> > *Security*
> >
> > In addressing security concerns, there are inherent risks associated with
> > serialization libraries such as pickle, dill, and cloudpickle due to
> their
> > potential for executing arbitrary code during deserialization. This is
> main
> > raison d’etre for both the DAG serializer and the XCom serializer. To
> > mitigate the risks stemming from deserialization of arbitrary objects,
> the
> > XCom serializer employs an allow list. This list specifies the permitted
> > classes during deserialization, minimizing the threat of potential
> > malicious exploits. The DAG serializer has a fixed scheme of inputs and
> > thus is limited during deserialization to those inputs.
> >
>
> Actually this is one of the things I do not understand (and I even had
> recently a discussion on this very subject with Elad). I think though
> we have a serious security problem with it if it is going to protect us
> from anything. Currently we have `airflow.*` exclusion and it's quite a
> terrible hole in it (if I understand what the protection is about).
> There is nothing preventing the user from declaring their own class
> as `airflfow.providers.my_provider_package.my_class` in their DAG . There
> is no
> mechanism preventing it. I think we need to have serious discussion about
> actual security of this mechanism, because I think it gives false sense of
> security, and introduces unnecessary friction without giving any security
> benefits.
> IMHO the only "real" mechanism to prevent something is the entrypoint
> mechanism we use for plugins and providers, because (unlike package name)
> it requires installing package in the environment by Deployment Manager,
> while package name limitation can be easily bypassed by simply having
> code on the PYTHONPATH (which our DAGs already do).
> But maybe I do not understand what it should protect us against. I think
> that one needs a separate discussion.
>


Pickle and the likes can execute arbitrary code that is inside the
serialized object. For example:

class PickleBomb:
    def __reduce__(self):
       cmd = ('touch ./badfile.txt')
       return os.system, (cmd,)


pickled = pickle.dumps(PickleBomb())
print(pickled)
# b'\x80\x03cposix\nsystem\nq\x00X\x13\x00\x00\x00touch
./badfile.txtq\x01\x85q\x02Rq\x03.'
 pickle.loads(pickled). # boom

Airflow's serializers do not execute arbitrary code. Also your assumption
of the 'serious' security problem is incorrect. The deserializer will need
to be able to load the class by using 'import_string' after verifying that
it matches the allowed pattern (by default 'airflow.*'). If a DAG author
does something like `airflow.providers.bla.bla` that class cannot be loaded
neither can the DAG author make that class available to a different DAG in
a different context (which is where a real attack vector would be).

So the risk mitigation looks like this:

1. Only read primitives from serialized version - do not execute
2. Verify allow list
3. Can class be loaded by import_string (NOT what is in memory)

An attack vector is considered (as mentioned above) if you can change
something in someone else's context. So, if I would be able to serialize
something and then influence / adjust its deserialization in a different
context which should not be under my control.

What you do in your own DAG we do not care about that. You can do
everything anyway. It is about protecting the DAG author from someone else
messing with the data.


> *Encryption and Decryption*
> >
> > *When displaying information in the UI and also during retrieval from the
> > database it is important the secure sensitive information as we do for
> > Connections. Efforts* are underway to introduce encryption and decryption
> > methods in the XCom serializer. The motivation behind this initiative
> stems
> > from the need to protect potentially confidential data, such as
> credentials
> > for cloud access, which may be required by underlying integrations. The
> > proposed pattern draws inspiration from established practices seen in the
> > Connection module. Despite the acknowledgment that certain fields may be
> > sensitive, the community has traditionally left the encryption of such
> > fields to the serializer at hand. The introduction of encryption and
> > decryption methods in the XCom serializer seeks to address this gap,
> > providing a standardized approach to enhance the overall security
> posture,
> > with the pending integration awaiting community discussion (this thread)
> > and
> > potential follow-up modifications.
> >
>
> I think this one requires a separate discussion - especially on expected
> security
> properties of it. I am not at all sure whether we really want to encrypt
> individual
> fields (down to primitives like int) in the serialized data. In order to do
> that, we
> would have to find ways to defined which of the fields of the arbitrary
> objects
> are sensitive. This is not coming out-of-the-box with all the various ways
> you can make an object serializable (@dataclass?, serialize/deserialize
> method?).
>

It is a choice. Selective encryption is handy for debugging - so that parts
of it are visible in the UI. This is how we handle Connections currently.
Your cons considerations are valid.

It has pretty unknown security properties (what happens when you encrypt an
> int
> with a FERNET key? - isn't it easily recoverable then?). While for
> connection - we could
> define a set of heuristics (names of fields and extras that we deem as
> sensitive),
> generalizing them to "arbitrary" objects seems rather difficult.
>
>
The security properties are actually quite well known. Fernet serializes
'bytes' it does not serialize anything else. So for 'int' as you do for
'str' you would need to convert those to bytes before encryption.

I have included a encryption protocol that deals with the primitives in my
PR. It is pretty easy actually. It is a protocol due to the fact that
fernet requires bytes and thus loses type information. The protocol
includes that type information.


> Also I think we should understand why we want to encrypt the data and why
> individual fields.
> Are we going to communicate with our users that some of the data sent there
> is sensitive?
> I understand that the need comes from the fact that SOME of sources of the
> XCOM data
> already contain sensitive information (say containing passwords and URLS).
> But would not
> it be better to simply encrypt the whole blob we are sending as XCOM rather
> than individual
> fields? It might of course be impacting performance so maybe not the best
> idea.


Modern processors have hardware acceleration for encryption. Performance
hit is very likely negligible in our case.


> Another
> option is to `expect` from the serializer to encrypt sensitive data on
> their own and give them
> an API so that they can do it manually while serializing the field (or
> maybe that is what really
> is proposed and I misunderstand that). So I think we need a bit more
> "design decision"
> here on how we are going to approach it.
>
>
>
We could. The important thing to note is that data leakage can occur
through XCom and Lineage. Do you want you random row to end up in lineage
readable for someone outside your span of control? DO you want to expose
your XCom value to someone having access to the UI?


>
> > *Coupling*
> >
> > Both primary serializers function transient to the user and developer.
> > Except when explicit serialization or possibly encryption is required.
> This
> > means that a provider or a task can just return an arbitrary object and
> > that is going to be handled by core Airflow without any adjustments.
> > Except, of course, when it doesn’t. The typical stack trace for this is
> an
> > error that says  “cannot serialize an object of type XX”.
>
>
> >
> > This error comes from the XCom serializer and can thus be solved by
> > providing the serialize/deserialize method, providing a custom serializer
> > OR converting the object into something the the serializer does
> understand.
> > However, this will result in loss of semantic information: the
> deserializer
> > will not know how to re-create the original object.
> >
>
> Yeah. I have a few problems with that if we were to couple too much some of
> the provider's code with serde:
>
> In a number of cases we do not care about restoring the original object -
> case
> in question with common.sql "tuples" returned by some DBApiHooks.
> Common.sql
> and DBApiHook expect that we want to have "Serializable tuple" as an output
> (basically - with a few caveats), But once returned we do not care if the
> original
> object was DatabricksNamedTuple or SnowflakeDictionary (both are possible
> to be
> returned by the DBApi. We care that the DBApiHook returns a serializable
> tuple and
> this is the "common" format we want to potentially serialize from (and back
> to). We never
> ever want to serialize it back to DatabricksNamedTuple. It's merely an
> implementation detail
> of the DBApi that we got it  while reading from the DB,
>
>
It is a choice that needs to be made at the API level - here Common.sql. In
this case the particular form of the tuple was, somewhat implicit, part of
the spec.

The real question is, when you find something unserializable do you then
provide a custom serializer for serde or do you convert your object to
something that can be serialized by default. The first one will make it
available to all other places where this objects needs to be serialized.
The latter limits that scope to where the conversion happens - here
common.sql.

So to provide an example in context of our discussion of PyODBC.Row (see
the discussion of the linked PR) only common.sql can now deal with
PyODBC.Row. If for some reason another provider returns a PyODBC.Row that
object remains unserializable.



> I fail to see what would be the benefits of coupling the code in the
> provider in this
> case with serde. Also I think (see security above) that IF we seriously
> think about
> security, then some mechanism of registering serializers from providers
> should
> be needed as soon as we solve the question of "how secure limiting the
> package name for serialized objects is" and whether we need it at all. For
> me
> this calls for extension of our plugin mechanisms and "registering"
> serializable
> data by providers,
>

I have explained what the downside is of the approach you took, although
its impact in this case is probably low (not many other places will return
PyODBC.Row, but consider DataFrame hypothetically).

I think you and I have a different understanding of what "coupling" means
here. If you drop a custom serializer into serde this would not suddenly
create a coupling with a provider. Serde then just knows how to handle a
particular object. This is transient to the provider code.

The change - or misunderstanding, might be that in common.sql you rely on
the fact that something needs to be serializable. Rather than when the user
uses TaskFlow and passes a variable, which has decoupling step during
execution. If that is the case it becomes a choice as explained above.



>
>
> *Considerations / Questions:*
> >
> > ·      Do we want to move to a single serializer? What would that look
> > like?
> >
>
> I am not sure. Case in question above with whatever is returned with
> DB API need not to be deserialized or versioned. We need a tuple. That's
> it.
>

I think you needed a serializable tuple so that it could go into XCom. This
makes the need for an explicit choice imho.


>
> Also I think there are several cases where different serializations
> might make more sense. Take Pydantic for one. Their serialization is very
> different, based on different assumptions and for example in AIP-44 what
> we are using for example their ability to serialize connected ORM classes.
> They simply take a Row of Task Instance and when serialized it will
> have it serialized to a Pydantic object that is not "ORM" bound but it will
> have
> all the foreign-keys linked classes available (DagRun, and Dag) with all
> the "read-only" properties that we want to see. And it's basically one-way
> serializing (we have no intention of deserializing it back into an ORM
> object)
>
> So while "serialization" as a name is common - these two serve quite
> different
> purposes. There is no point in using serde for Internal API, have
> versioning and
> deserialization, and more importantly it's going to be way slower with JSON
> and
> Python code. Pydantic v2 uses Rust a lot and pretty much all serialization
> happens
> in Rust and is up to 16x faster than any Python code we could write for it.
> And for
> internal API we need as FAST as possible because basically we are going to
> replace multiple DB queries that are happening internally in our
> components.
>
>
On speed. Afaik AIP-44 still makes use of the DAG Serialization code apart
from Pydantic's rust implementation. The DAG serialization code is seriously
slow. So, before claiming speed improvements here show the numbers please
:-).


> I think in this context, we do not need "one serialization to rule them
> all". I'd even
> argue that maybe we do not need it for DAG serializers. Maybe it's "good
> enough"
> and maybe it's OK to keep a separate DAG-focused serializer for the purpose
> of serializing DAGs? Again in this case we do not care about deserializing
> it
> to the original form. When we serialize DAG to DB, what we really care
> about
> is to see the serializable form of it, not the original one. Actually even
> more
> - we DO NOT want to see original form. Having a serialized form that you
> cannot
> come back to the original (DAG Python code) is a security property of DAG
> serialization.
>
>
We can of course keep both. However if you ever looked at the DAG
serialization code you might want to reconsider ;-). Alternatively, you
could build on serde and provide the guarantees that you require there.  It
can be done. I have been able to serialize a DAG with serde (hence knowing
the 1:10 ratio), but until now havent added it, because its schema looks
different.

BTW: There is no need from the XCom serializer / serde to deserialize into
the original object. By convention we like to of course, but it is not
required.


>
> ·      Do we want to move serializers to their respective providers,
> > effectively making “serde” (or any other serializer) into a Public API?
> >
>
> This is a good question, I think yes. Say deltalake and iceberg - I'd say
> it makes sense
> to move them to "Databricks" and "Tabular" providers respectively. Mainly
> because of
> dependencies. We do not want Airflow core to depend on the client libraries
> and we've
>

For the record: the XCom serializer lazy loads its dependencies. So it does
not make
core Airflow dependent on client libraries arguably.


> already learned the benefits of moving the code out to providers. Mainly -
> because if
> there is a bugfix or upgrade to a 3rd-party library we will not be held be
> having to
> release Airflow to fix and adapt to those. We already had many cases like
> this and this
> has already happened even for k8s executor - which we could fix bugs in
> without having
> to release a new Airflow version.
>
>
The con is that the API becomes strict. We cannot change it easily anymore.


> Maybe that's a bit of a different question for Pandas - as we have no
> separate Pandas provider
> and we use and link to it already from Airflow core. But... maybe it's also
> a sign we need pandas
> provider in this case? This question for me is open.
>
>
To me this depends on what we want to do with data awareness (separate
discussion). If you think:

ObjectStoragePath : Table : Dataframe

It is useful to have dataframe as part of core. This could be pandas or
maybe polars which is so much faster :-).


>
> > ·      Do we want encryption of values? Where should that take place?
> >
>
> I think we need to discuss the use case for it and how. the "API" to do
> that would look like. IMHO
> we should either encrypt everything or each serializer should "hard-code"
> encrypting specific
> data before generic serialization starts and decrypt after deserialization
> ends.So rather than
> building in encryption in serialization, we should provide simple
> primitives ("encrypt", "decrypt") that
> serializer should use to encrypt it's data, but
> serialization/deserialization should not care about
> whether the field is encrypted or not. Maybe this is what you also thought
> about - if so - I am good
> with it.
>
>
We do as you say at the moment also in the linked PR. The serializer does
not deal with encrypted values directly. It is up to the object to
determine what needs to be encrypted and what not.



> > ·      What do we define as a best practice for interacting with the
> > serializers? Are we okay with losing semantic information if using an
> > intermediate format? Or do we find it the best practice to provide a
> > serializer?
> >
>
> I think it should be case-by-case. And we should clearly define where the
> XCom serializer
> should be used in and in what context. And where other serializers might
> and should be
> used. I see no particular value in applying the same rules and assumptions
> for the
> different serialization cases we might have.
>
>
Case-by-case is fine to me as long as the choice is explicit and thus
options are considered.


>
> > *Other links:*
> >
> > ·      Docs on serde / Xcom serializer:
> > https://github.com/apache/airflow/pull/35885
> >
> > ·      Encryption: https://github.com/apache/airflow/pull/35867
> >
> > ·      Discussion on PyODBC:
> https://github.com/apache/airflow/pull/32319
> >
> >
> >

-- 

--
Bolke de Bruin
[email protected]

Re: [DISCUSS] Serialization in Apache Airflow: Clarifications and Future Directions

Reply via email to