[
https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629809#action_12629809
]
Tom White commented on HADOOP-3788:
-----------------------------------
Looks like your on the right track with PBDeserializer and PBSerializer. I
think one of the problems is with the sequence file format and how it interacts
with protocol buffers.
SequenceFile.Reader#next(Object) reads the next key and value into a single
buffer DataInputBuffer which is given to PBDeserializer to deserialize from in
the mergeFrom call. As you point out, PB reads to the end of the stream, so
when it tries to read the key it consumes the whole buffer, consuming the value
as well, which causes the exception. This is not a problem with other
serialization frameworks that we have seen so far (Writable, Java
Serialization, Thrift), which know how much of the stream to consume.
We could fix this by having separate buffers for key and value, much like
org.apache.hadoop.mapred.IFile does. Or perhaps we could change deserializer to
take a length. The latter would only work if it is possible to restrict the
number of bytes read from the stream in PB. Is this the case?
Does it work if you don't try to read from a sequence file? Is your use-case
based on being able to read from sequence files?
A couple of other points:
* PBDeserializerTracker reads from the stream twice, which isn't going to work.
You need to tee the stream to do debugging.
* Can we keep a Builder instance per deserializer rather than create a new one
for each call to deserialize? We should call clear on it each time to reset its
state.
> Add serialization for Protocol Buffers
> --------------------------------------
>
> Key: HADOOP-3788
> URL: https://issues.apache.org/jira/browse/HADOOP-3788
> Project: Hadoop Core
> Issue Type: Wish
> Components: examples, mapred
> Affects Versions: 0.19.0
> Reporter: Tom White
> Assignee: Alex Loddengaard
> Fix For: 0.19.0
>
> Attachments: hadoop-3788-v1.patch, protobuf-java-2.0.1.jar
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding
> data in a compact binary format. This issue is to write a
> ProtocolBuffersSerialization to support using Protocol Buffers types in
> MapReduce programs, including an example program. This should probably go
> into contrib.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.