[ 
https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Loddengaard updated HADOOP-3788:
-------------------------------------

    Attachment: hadoop-3788-v2.patch

Attaching a new patch.  Changes:

 * Removed _*Tracker_ and _TestPBHadoopStreams_ because they weren't very 
useful now that we've established streams have trailing data
 * Did not keep a single Builder instance in _PBDeserializer_, because Builders 
need to be rebuilt once _build()_ has been called.  From the PB API: "[build()] 
Construct the final message. Once [build()] is called, the Builder is no longer 
valid, and calling any other method may throw a NullPointerException. If you 
need to continue working with the builder after calling build(), clone() it 
first."  I made the decision to just re-instantiate instead of clone, because I 
thought the performance differences were negligible.  Please argue with me if 
I'm wrong.
* Changed SequenceFile.Reader#next(Object)
* Changed _TestPBSerialization_ to just write and read a SequenceFile, 
respectively.
* Created a new test, _TestPBSerializationMapReduce_, that uses PBs in a 
MapReduce program

_TestPBSerialization_ passes, but _TestPBSerializationMapReduce_ does not, 
which means you're right, Tom, that other code will need to change, though I'm 
not familiar enough with Hadoop to say more than that.  If we decide to move 
further along by changing Hadoop such that deserializers will never be given 
trailing data, then more guidance would be greatly appreciated :).

This patch breaks a few existing tests such as 
_org.apache.hadoop.fs.TestCopyFiles_ and _org.apache.hadoop.fs.TestFileSystem_. 
 It's unclear if my change causes these or if my lack of change to others areas 
does.  Regardless, I think this proves that creating the contract of not having 
extra data in the _Deserializer_'s _InputStream_ would probably be a large 
change.

There is a discussion going on in the PB Google Group about possibly making PBs 
self-delimiting.  Take a look 
[here|http://groups.google.com/group/protobuf/browse_thread/thread/b0ce2c7d8b05896e?hl=en].
  In summary, a few different people are trying to determine the best way to 
allow self-delimiting, though there hasn't been any talk about a schedule.

> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3788-v1.patch, hadoop-3788-v2.patch, 
> protobuf-java-2.0.1.jar
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding 
> data in a compact binary format. This issue is to write a 
> ProtocolBuffersSerialization to support using Protocol Buffers types in 
> MapReduce programs, including an example program. This should probably go 
> into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to