Hi Bing,

The issue here is that BytesWritable uses an internal buffer which is grown
but not shrunk. The cause of this is that Writables in general are single
instances that are shared across multiple input records. If you look at the
internals of the input reader, you'll see that a single BytesWritable is
instantiated, and then each time a record is read, it's read into that same
instance. The purpose here is to avoid the allocation cost for each row.

The end result is, as you've seen, that getBytes() returns an array which
may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not zero.

Unfortunately, if the protobuffer API doesn't allow you to deserialize out
of a smaller portion of a byte array, you're out of luck and will have to do
the copy like you've mentioned. I imagine, though, that there's some way
around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng <[email protected]> wrote:

>
> I tried to store protocolbuffer as BytesWritable in a sequence file <Text,
> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key), new
> BytesWritable(protobuf.convertToBytes())).  When reading the values from
> key/value pairs using value.get(), it returns more then what's stored.
> However, value.getSize() returns the correct number.  This means in order
> to
> convert the byte[] to protocol buffer again, I have to do
> Arrays.copyOf(value.get(), value.getSize()).  This happens on both version
> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for a
> few entries in the sequence file below.  The extra bytes in value.get() all
> have values of zero.
>
> value.getSize(): 7066   value.get().length: 10599
> value.getSize(): 36456  value.get().length: 54684
> value.getSize(): 32275  value.get().length: 54684
> value.getSize(): 40561  value.get().length: 54684
> value.getSize(): 16855  value.get().length: 54684
> value.getSize(): 66304  value.get().length: 99456
> value.getSize(): 26488  value.get().length: 99456
> value.getSize(): 59327  value.get().length: 99456
> value.getSize(): 36865  value.get().length: 99456
>
> --
> View this message in context:
> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Reply via email to