Jakob Homan created SAMZA-198:
---------------------------------

             Summary: Provide SystemStreamPartition info to SerDe 
fromBytes/toBytes methods
                 Key: SAMZA-198
                 URL: https://issues.apache.org/jira/browse/SAMZA-198
             Project: Samza
          Issue Type: Bug
            Reporter: Jakob Homan


Right now the Deserializer fromBytes method takes just a byte array, meaning 
that it doesn't know anything about where those bytes came from.

We have a use case with Avro messages coming from Kafka where we may be getting 
several different versions of the same schema (each different version coming 
from a different stream-partition).  This works okay.  However, in the same 
stream task, we're actually consuming from more than one type of Avro message 
and each of those types has that same situation.

Once we're in the process method we can take the generic record and poke it for 
its internal structure to see what type and version it is.  At this point we 
can re-encode it if necessary to bring its schema version up to the latest 
before sending it on.  However, this extra work is expensive and is dominating 
the time spent in the process method.

However, if at deserialization time we knew what SSP the message came from, we 
could provide the Avro GenericDatumReader the reader schema, thus saving the 
expensive re-encode step in the process method.

I imagine other systems could benefit from this extra info as well.  The 
information is available in the IncomingMessageEnvelope when we call the 
deserializer, it's just not being passed in.

(A parallel argument applies to the toBytes method in the Serializer interface)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to