[ 
https://issues.apache.org/jira/browse/SPARK-6986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500439#comment-14500439
 ] 

Yin Huai commented on SPARK-6986:
---------------------------------

Seems the information we need to know is if an object is a key/value pair, a 
key, or a value. With this information, the underlying serializer can 
understand how to serialize an object (for a specialized serializer, it 
understands the structure of a key and a value). I am proposing the following 
interface changes

For {{SerializationStream}}, we add
{code}
def writeKey[T: ClassTag](key: T): SerializationStream
def writeValue[T: ClassTag](value: T): SerializationStream
{code}
When we want to serialize a key/value pair, we still use the general-purpose 
{{writeObject}} method.

For {{DeserializationStream}}, we add
{code}
def readKey[T: ClassTag](): T
def readValue[T: ClassTag](): T
{code}
When we want to deserialize a key/value pair, we still use the general-purpose 
{{readObject}} method.

For shuffle, since we do not directly interact with a {{SerializationStream}} 
and we use {{BlockObjectWriter}} write data out, we also need to add 
{{writeKey}} and {{writeValue}} to {{BlockObjectWriter}}. In 
{{ExternalSorter}}, {{spillToMergeableFile}} will use these two methods to 
write data out. For other places where we write key/value pairs, we still use 
the general-purpose {{write}} method of a {{BlockObjectWriter}}. 

[~matei] [~pwendell] [~rxin] [~marmbrus] What do you think?

> Make SerializationStream/DeserializationStream understand key/value semantic
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-6986
>                 URL: https://issues.apache.org/jira/browse/SPARK-6986
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core, SQL
>            Reporter: Yin Huai
>            Priority: Blocker
>
> Our existing Java and Kryo serializer are both general-purpose serialize. 
> They treat every object individually and encode the type of an object to 
> underlying stream. For Spark, it is common that we serialize a collection 
> with records having the same types (for example, records of a DataFrame). For 
> these cases, we do not need to write out types of records and we can take 
> advantage the type information to build specialized serializer. To do so, 
> seems we need to extend the interface of 
> SerializationStream/DeserializationStream, so a 
> SerializationStream/DeserializationStream can have more information about 
> objects passed in (for example, if an object is key/value pair, a key, or a 
> value). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to