Hi Ryan,
Thanks for getting back to me.
Yes, the change is for the JAVA library, as you mention in other languages it
doesn't seem easy to make it as a library to be able to delegate like we don in
the JVM. It is however feasible to deserialize the data in another language,
given access to the same encryption libraries, as the structure of the
serialized object is known to the developer.
I modified the GenericDatumWriter/Reader as there I found the main entry
methods:
```
public D read(D reuse, Decoder in) throws IOException
```
And
```
public void write(D datum, Encoder out) throws IOException
```
I do have also a generalized template that is used for all our "tenanted"
schemas, that extends an abstract class and delegates to it the
beforeDeserialization, afterSerialization so to centralize the code.
About the customCode, I didn't try to get that route. I didn't find much
documentation to tell you the truth.
I did however try couple of other extension one of which was the logicTypes. As
you can see in the signature
```
public ByteBuffer fromBytes(ByteBuffer value, Schema schema, LogicalType
type)
```
there we don't have access to the original object where we would have the
tenant information needed to retrieve the right token to use to encrypt the
data.
Would it make sense that I open a branch to show some code?
Best,
-Enrico
On 6/17/20, 4:39 PM, "Ryan Skraba" <[email protected]> wrote:
Hi! I was interested enough to watch the entire video from Flink Forward.
I do think this is a good proposal, and adding hooks to "customize"
the serialized bytes is a pretty neat idea. The developer can benefit
from learning or using Avro-generated classes and the SDK, and still
using standard serialization underneath the customized logic.
At first glance, this would stay in the Java SDK, right? I mean, once
you've customized your Avro specific record with it's own
serialization layer, there's little hope (without extensive work) for
a different language to expect to be able to read it. In other words,
you'd never be able to write it to an Avro file and never expect it to
be readable via another programming language or using a generic
model... which is kind of the point!
Is there any use to having these changes in the
GenericDatumWriter/Reader as opposed to the
SpecificDatumWriter/Reader? Would there ever be an instance where a
generic model of data would delegate serialization?
Do you think that the necessary changes you've made to the specific
data templates could be generalized? I believe I've already come
across a situation where we've customized the "extends
MySpecificRecordBase" part of the templates -- it could be a
configuration option. I'm not sure whether passing along the record
context (tenant id) to nested elements is generalizable, but I haven't
thought very hard about it yet.
Have you looked into the `customEncode` parts of generated specific
records? This or something similar might be a more flexible technique
than the SerializeFinalizationDelegate interface methods.
Thanks for sharing! Ryan
On Tue, Jun 16, 2020 at 3:02 PM Enrico Agnoli
<[email protected]> wrote:
>
> Hi,
>
> I would like to make a proposal change to AVRO to allow services to
integrate some logic after serialization and before deserialization.
> We use AVRO to support the data serialization in our streaming
infrastructure and we decided to extend it to provide us the possibility to
encrypt the data with info available directly on the data itself: the owner of
it.
> The change-set is pretty small and I would like to hear from you if it
makes sense to contribute it back to the project.
>
> == The problem is:
> Multi-tenants applications have the need to encrypt data (with the keys
of the owner/tenant that generated that piece of data) every time it is
serialized to avoid commingling of different tenant data. To do so,
transparently to the application, the ideal place to implement the encryption
it is in the serialization library (AVRO).
>
> == Proposal:
> We modified the AVRO code to have afterSerialization and
beforeDeserialization hooks that can use object defined values (the
tenant/owner of that data) to implement encryption.
> In the code we propose to submit we implemented a new interface:
`SerializeFinalizationDelegate.java`
> ```
> public interface SerializeFinalizationDelegate {
> void afterSerialization(ByteArrayOutputStream serializedData, Encoder
finalEncoder);
> Decoder beforeDeserialization(Decoder dataToDecode);
> }
> ```
> That needs to be implemented by any AVRO serializable class that wants to
define a post-serialization or pre-deserialization logic.
> `GenericDatumWriter` and `GenericDatumReader` are modified to delegate to
the object implementation of the methods above.
>
> More info can be found at
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.slideshare.net_FlinkForward_multi-2Dtenanted-2Dstreams-2Dworkday-2Denrico-2Dagnoli-2Dleire-2Dfernandez-2Dde-2Dretana-2Droitegui-2Dworkday-2D185815223&d=DwIFaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=5oal4CtBGP1ioAe2G2rMT-XLCpWwh5R4aEw1TqtlCnc&m=Xu7g3Tz4gpvKrNVQaH8E_gOocZRRxOjiYDGo8Y44Peg&s=dea8kpG8JMBbu6GIqT176VBrvrIrnXdoMByO2cD9SS4&e=
from slide 21
>
>
> What do you think about this proposal? I wanted to first start a
discussion, but if it helps I can create a patch or a branch to show the change,
>
> Hope to hear from you,
> -Enrico