Thanks for the clarification.  Though I still find it strange why not have
the get() method return what's actually stored regardless of buffer size. 
Is there any reason why you'd want to use/examine what's in the buffer?


Todd Lipcon-4 wrote:
> 
> Hi Bing,
> 
> The issue here is that BytesWritable uses an internal buffer which is
> grown
> but not shrunk. The cause of this is that Writables in general are single
> instances that are shared across multiple input records. If you look at
> the
> internals of the input reader, you'll see that a single BytesWritable is
> instantiated, and then each time a record is read, it's read into that
> same
> instance. The purpose here is to avoid the allocation cost for each row.
> 
> The end result is, as you've seen, that getBytes() returns an array which
> may be larger than the actual amount of data. In fact, the extra bytes
> (between .getSize() and .get().length) have undefined contents, not zero.
> 
> Unfortunately, if the protobuffer API doesn't allow you to deserialize out
> of a smaller portion of a byte array, you're out of luck and will have to
> do
> the copy like you've mentioned. I imagine, though, that there's some way
> around this in the protobuffer API - perhaps you can use a
> ByteArrayInputStream here to your advantage.
> 
> Hope that helps
> -Todd
> 
> On Wed, Apr 8, 2009 at 4:59 PM, bzheng <[email protected]> wrote:
> 
>>
>> I tried to store protocolbuffer as BytesWritable in a sequence file
>> <Text,
>> BytesWritable>.  It's stored using SequenceFile.Writer(new Text(key), new
>> BytesWritable(protobuf.convertToBytes())).  When reading the values from
>> key/value pairs using value.get(), it returns more then what's stored.
>> However, value.getSize() returns the correct number.  This means in order
>> to
>> convert the byte[] to protocol buffer again, I have to do
>> Arrays.copyOf(value.get(), value.getSize()).  This happens on both
>> version
>> 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
>> a
>> few entries in the sequence file below.  The extra bytes in value.get()
>> all
>> have values of zero.
>>
>> value.getSize(): 7066   value.get().length: 10599
>> value.getSize(): 36456  value.get().length: 54684
>> value.getSize(): 32275  value.get().length: 54684
>> value.getSize(): 40561  value.get().length: 54684
>> value.getSize(): 16855  value.get().length: 54684
>> value.getSize(): 66304  value.get().length: 99456
>> value.getSize(): 26488  value.get().length: 99456
>> value.getSize(): 59327  value.get().length: 99456
>> value.getSize(): 36865  value.get().length: 99456
>>
>> --
>> View this message in context:
>> http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Reply via email to