[jira] [Commented] (SAMZA-198) Provide SystemStreamPartition info to SerDe fromBytes/toBytes methods

Jakob Homan (JIRA) Thu, 20 Mar 2014 15:52:27 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942453#comment-13942453
 ]


Jakob Homan commented on SAMZA-198:
-----------------------------------

bq. 1. What do you think about just giving the serdes IncomingMessageEnvelope 
with byte arrays for both the key and value? This is a super set of the 
information you need.
What would be the return type? Since the interface is in Java and we can't 
return tuples, we'd need some type of collection or new type.  Would make the 
SerdeManager code about easier though.

bq. This is somewhat specific to serde'ing messages. The nice thing about 
toBytes and fromBytes right now is that it's a serde that can be used for 
everything (e.g. leveldb serialization, etc) including cases where the bytes 
don't have a SystemStreamPartition associated with them.
We could have two methods for the serde to implement (for both to/fromBytes): 
one with SSP and one without.  It would be easy for the Serde to forward a call 
from the latter to the former when useful.


> Provide SystemStreamPartition info to SerDe fromBytes/toBytes methods
> ---------------------------------------------------------------------
>
>                 Key: SAMZA-198
>                 URL: https://issues.apache.org/jira/browse/SAMZA-198
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jakob Homan
>
> Right now the Deserializer fromBytes method takes just a byte array, meaning 
> that it doesn't know anything about where those bytes came from.
> We have a use case with Avro messages coming from Kafka where we may be 
> getting several different versions of the same schema (each different version 
> coming from a different stream-partition).  This works okay.  However, in the 
> same stream task, we're actually consuming from more than one type of Avro 
> message and each of those types has that same situation.
> Once we're in the process method we can take the generic record and poke it 
> for its internal structure to see what type and version it is.  At this point 
> we can re-encode it if necessary to bring its schema version up to the latest 
> before sending it on.  However, this extra work is expensive and is 
> dominating the time spent in the process method.
> However, if at deserialization time we knew what SSP the message came from, 
> we could provide the Avro GenericDatumReader the reader schema, thus saving 
> the expensive re-encode step in the process method.
> I imagine other systems could benefit from this extra info as well.  The 
> information is available in the IncomingMessageEnvelope when we call the 
> deserializer, it's just not being passed in.
> (A parallel argument applies to the toBytes method in the Serializer 
> interface)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SAMZA-198) Provide SystemStreamPartition info to SerDe fromBytes/toBytes methods

Reply via email to