Re: BytesWritable get() returns more bytes then what's stored

Gaurav Chandalia Wed, 08 Apr 2009 18:53:56 -0700

Arrays.copyOf isn't required, protocol buffer has a method to mergefrom bytes. you can do:


protobuf.newBuilder().mergeFrom(value.getBytes(), 0, value.getLength())

the above is for hadoop 0.19.1, the corresponding method names forBytesWritable for earlier version of hadoop might be slightly different.


--
gaurav



On Apr 8, 2009, at 7:13 PM, Todd Lipcon wrote:

Hi Bing,
The issue here is that BytesWritable uses an internal buffer whichis grownbut not shrunk. The cause of this is that Writables in general aresingleinstances that are shared across multiple input records. If you lookat theinternals of the input reader, you'll see that a singleBytesWritable isinstantiated, and then each time a record is read, it's read intothat sameinstance. The purpose here is to avoid the allocation cost for eachrow.
The end result is, as you've seen, that getBytes() returns an arraywhich
may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, notzero.
Unfortunately, if the protobuffer API doesn't allow you todeserialize outof a smaller portion of a byte array, you're out of luck and willhave to dothe copy like you've mentioned. I imagine, though, that there's someway
around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng <[email protected]> wrote:
I tried to store protocolbuffer as BytesWritable in a sequence file<Text,BytesWritable>. It's stored using SequenceFile.Writer(newText(key), newBytesWritable(protobuf.convertToBytes())). When reading the valuesfromkey/value pairs using value.get(), it returns more then what'sstored.However, value.getSize() returns the correct number. This means inorder
to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()). This happens on bothversion0.17.2 and 0.18.3. Does anyone know why this happens? Samplesizes for afew entries in the sequence file below. The extra bytes invalue.get() all
have values of zero.

value.getSize(): 7066   value.get().length: 10599
value.getSize(): 36456  value.get().length: 54684
value.getSize(): 32275  value.get().length: 54684
value.getSize(): 40561  value.get().length: 54684
value.getSize(): 16855  value.get().length: 54684
value.getSize(): 66304  value.get().length: 99456
value.getSize(): 26488  value.get().length: 99456
value.getSize(): 59327  value.get().length: 99456
value.getSize(): 36865  value.get().length: 99456

--
View this message in context:
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: BytesWritable get() returns more bytes then what's stored

Reply via email to