[ 
https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629809#action_12629809
 ] 

Tom White commented on HADOOP-3788:
-----------------------------------

Looks like your on the right track with PBDeserializer and PBSerializer. I 
think one of the problems is with the sequence file format and how it interacts 
with protocol buffers.

SequenceFile.Reader#next(Object) reads the next key and value into a single 
buffer DataInputBuffer which is given to PBDeserializer to deserialize from in 
the mergeFrom call. As you point out, PB reads to the end of the stream, so 
when it tries to read the key it consumes the whole buffer, consuming the value 
as well, which causes the exception. This is not a problem with other 
serialization frameworks that we have seen so far (Writable, Java 
Serialization, Thrift), which know how much of the stream to consume.

We could fix this by having separate buffers for key and value, much like 
org.apache.hadoop.mapred.IFile does. Or perhaps we could change deserializer to 
take a length. The latter would only work if it is possible to restrict the 
number of bytes read from the stream in PB. Is this the case?

Does it work if you don't try to read from a sequence file? Is your use-case 
based on being able to read from sequence files?

A couple of other points:
* PBDeserializerTracker reads from the stream twice, which isn't going to work. 
You need to tee the stream to do debugging.
* Can we keep a Builder instance per deserializer rather than create a new one 
for each call to deserialize? We should call clear on it each time to reset its 
state.



> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3788-v1.patch, protobuf-java-2.0.1.jar
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding 
> data in a compact binary format. This issue is to write a 
> ProtocolBuffersSerialization to support using Protocol Buffers types in 
> MapReduce programs, including an example program. This should probably go 
> into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to