[ 
https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630130#action_12630130
 ] 

Tom White commented on HADOOP-3788:
-----------------------------------

bq. PBs do not provide a mechanism to limit the amount of data read from a 
stream, so your solution of breaking key, value pairs into two streams is the 
approach we should take.

Of course the other option is to propose changes to PB (which is open source) 
to limit the amount of data read. I think a change to CodedInputStream would be 
relatively simple.

As a quick experiment I modified a working MapReduce program so that the 
deserializer read to the end of the stream. It failed in ReduceValuesIterator. 
So to make this work would require changing more than just SequenceFile. 
Perhaps this reveals a bug in the MR system - one that has been masked because 
existing serializers only consume as much as they need. (So if they are given 
more than they need it's not a problem.) Either way I worry about defining the 
contract for deserializers so that the end of the stream marks the end of the 
object being read as it might limit optimizations we may make in the future. 
What do others think?

> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3788-v1.patch, protobuf-java-2.0.1.jar
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding 
> data in a compact binary format. This issue is to write a 
> ProtocolBuffersSerialization to support using Protocol Buffers types in 
> MapReduce programs, including an example program. This should probably go 
> into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to