Hi, I'm working on a project where I plan to put clickstream data into Kafka serialized using AVRO. In a later stage I want these records persisted into AVRO files so they can be used by people using PIG.
So far this is no problem at all. Now some of those fields (not all) are privacy sensitive so I do not want them to be 'plain text' in the data. I want them to be encrypted so that they can only be read by the people who need access to these fields. The only thing I have found so far about encrypting data in AVRO is https://issues.apache.org/jira/browse/AVRO-1371 which states Quote * Similar to compression and decompression, encryption and decryption * * can be implemented with Codecs, a concept that already exists in Avro.* I had a look at that Codecs API and it simply takes the 'entire thing' as a ByteBuffer and compresses it. So this means the entire record is encrypted (which is not what I want). I want without storing the data twice (it is too big for that): - All consumers to be able to read 'most' fields. - Some consumers to be able to read 'all' fields. I was contemplating to simply put the keyid and the encrypted bytes into a field of the type 'bytes'. That way there is no need to change the underlying file format. To keep it simple I would simply have the application code generate the 'encrypted value' and store it in the record. Then at the PIG side I would simply create a UDF that does the decryption again. To make using this easier I even thought about extending the IDL language (keyword 'encrypted') and then generate extra/different utility methods that wrap/encrypt that field via the setters/builders and put that in a normal AVRO file as bytes. But before I start coding; Has anyone ever thought about what the 'right' approach is to do this in AVRO? Has anyone build something I can have a look at? -- Best regards Niels Basjes
