Arrays.copyOf isn't required, protocol buffer has a method to merge from bytes. you can do:

protobuf.newBuilder().mergeFrom(value.getBytes(), 0, value.getLength())

the above is for hadoop 0.19.1, the corresponding method names for BytesWritable for earlier version of hadoop might be slightly different.

--
gaurav



On Apr 8, 2009, at 7:13 PM, Todd Lipcon wrote:

Hi Bing,

The issue here is that BytesWritable uses an internal buffer which is grown but not shrunk. The cause of this is that Writables in general are single instances that are shared across multiple input records. If you look at the internals of the input reader, you'll see that a single BytesWritable is instantiated, and then each time a record is read, it's read into that same instance. The purpose here is to avoid the allocation cost for each row.

The end result is, as you've seen, that getBytes() returns an array which
may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not zero.

Unfortunately, if the protobuffer API doesn't allow you to deserialize out of a smaller portion of a byte array, you're out of luck and will have to do the copy like you've mentioned. I imagine, though, that there's some way
around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng <[email protected]> wrote:


I tried to store protocolbuffer as BytesWritable in a sequence file <Text, BytesWritable>. It's stored using SequenceFile.Writer(new Text(key), new BytesWritable(protobuf.convertToBytes())). When reading the values from key/value pairs using value.get(), it returns more then what's stored. However, value.getSize() returns the correct number. This means in order
to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()). This happens on both version 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a few entries in the sequence file below. The extra bytes in value.get() all
have values of zero.

value.getSize(): 7066   value.get().length: 10599
value.getSize(): 36456  value.get().length: 54684
value.getSize(): 32275  value.get().length: 54684
value.getSize(): 40561  value.get().length: 54684
value.getSize(): 16855  value.get().length: 54684
value.getSize(): 66304  value.get().length: 99456
value.getSize(): 26488  value.get().length: 99456
value.getSize(): 59327  value.get().length: 99456
value.getSize(): 36865  value.get().length: 99456

--
View this message in context:
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Reply via email to