potiuk commented on PR #36801:
URL: https://github.com/apache/airflow/pull/36801#issuecomment-1893825121

   > What if we encrypt entire payload? By the same way as we done in 
Connection Extra? By doing that it required to convert existed type field in DB 
from JSON (text depends on DB) to binary type. And this could be done during 
migration with convertion/encryption existed payload.
   
   Qite Agree. I also thought the same - see my comment 
https://github.com/apache/airflow/pull/36801#discussion_r1452796195 and our 
earlier discusion with @jscheffl  
https://github.com/apache/airflow/pull/36492#issuecomment-1884389895 
   
   First of all -  I think indeed there is no particular reason to use JSON 
type here - we are using ExtendedJSON type which implements airflow 
serialization, but that is on its own  weird - our serialization serializes 
more than JSON - AND it uses  much proprietary JSON that is generally not 
suitable to using as JSON source of data - the way it is implemented (and the 
fact it uses JSON) is merely an implementation detail - so using JSON field to 
store it, makes very little sense, because (as @jscheffl mentioned in his 
comments - it anyhow requires the data to be deserialized using our Python code 
to be "useful" so we generaly lose any DB-level support to read and query that 
json data. So it makes very little sense to store it in JSON field (and we are 
not using anywhere the fact that it is JSON field). IMHO we should convert it 
to a simple TEXT or even BLOB field.
   
   Secondly the security property/ performance - let me repeat from what I 
wrote above - I have no hard data, but intuitively it's much more secure and 
faster to encrypt the  whole blob rather than multiple individual fields in it. 
It should takes a lot more memory and CPU (and enthropy which is also important 
performance factor for encryption) to convert and encrypt multiple individual 
(sometimes short) strings in a data structure, than it is to encrypt the whole 
string representation in a single operation. Especially if you consider the 
usual padding for short strings, and include the fact that encrypting multiple 
short string generally decreases "security" properties of these (think 
dictionary/rainbow attacks  - it makes much more sense to encrypt the whole 
blob - especially if you have no need/possibility to read the remaining 
unencrypted fields for all your data. 
   
   I made the same comments in https://github.com/apache/airflow/pull/35867 by 
@bolkedebruin - who wants to add encryption for serde primitives, which I think 
is generally bad idea and we should instead encrypt the whole serialized blob. 
Its faster, more secure,and we could even add a nice tool that will decrypt it 
for diagnostic purposes if needed.
   
   And most of all - it is far more future-proof because you do not have to 
rely on individuals who write serialziation to mark all the fields that require 
to be serialized (the last point is somewhat mitigated by  @hussein-awala here 
by serializing all strings, but I think it solves just a subset of it). 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to