potiuk commented on PR #36801: URL: https://github.com/apache/airflow/pull/36801#issuecomment-1893825121
> What if we encrypt entire payload? By the same way as we done in Connection Extra? By doing that it required to convert existed type field in DB from JSON (text depends on DB) to binary type. And this could be done during migration with convertion/encryption existed payload. Qite Agree. I also thought the same - see my comment https://github.com/apache/airflow/pull/36801#discussion_r1452796195 and our earlier discusion with @jscheffl https://github.com/apache/airflow/pull/36492#issuecomment-1884389895 First of all - I think indeed there is no particular reason to use JSON type here - we are using ExtendedJSON type which implements airflow serialization, but that is on its own weird - our serialization serializes more than JSON - AND it uses much proprietary JSON that is generally not suitable to using as JSON source of data - the way it is implemented (and the fact it uses JSON) is merely an implementation detail - so using JSON field to store it, makes very little sense, because (as @jscheffl mentioned in his comments - it anyhow requires the data to be deserialized using our Python code to be "useful" so we generaly lose any DB-level support to read and query that json data. So it makes very little sense to store it in JSON field (and we are not using anywhere the fact that it is JSON field). IMHO we should convert it to a simple TEXT or even BLOB field. Secondly the security property/ performance - let me repeat from what I wrote above - I have no hard data, but intuitively it's much more secure and faster to encrypt the whole blob rather than multiple individual fields in it. It should takes a lot more memory and CPU (and enthropy which is also important performance factor for encryption) to convert and encrypt multiple individual (sometimes short) strings in a data structure, than it is to encrypt the whole string representation in a single operation. Especially if you consider the usual padding for short strings, and include the fact that encrypting multiple short string generally decreases "security" properties of these (think dictionary/rainbow attacks - it makes much more sense to encrypt the whole blob - especially if you have no need/possibility to read the remaining unencrypted fields for all your data. I made the same comments in https://github.com/apache/airflow/pull/35867 by @bolkedebruin - who wants to add encryption for serde primitives, which I think is generally bad idea and we should instead encrypt the whole serialized blob. Its faster, more secure,and we could even add a nice tool that will decrypt it for diagnostic purposes if needed. And most of all - it is far more future-proof because you do not have to rely on individuals who write serialziation to mark all the fields that require to be serialized (the last point is somewhat mitigated by @hussein-awala here by serializing all strings, but I think it solves just a subset of it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
