Hi Bing, The issue here is that BytesWritable uses an internal buffer which is grown but not shrunk. The cause of this is that Writables in general are single instances that are shared across multiple input records. If you look at the internals of the input reader, you'll see that a single BytesWritable is instantiated, and then each time a record is read, it's read into that same instance. The purpose here is to avoid the allocation cost for each row.
The end result is, as you've seen, that getBytes() returns an array which may be larger than the actual amount of data. In fact, the extra bytes (between .getSize() and .get().length) have undefined contents, not zero. Unfortunately, if the protobuffer API doesn't allow you to deserialize out of a smaller portion of a byte array, you're out of luck and will have to do the copy like you've mentioned. I imagine, though, that there's some way around this in the protobuffer API - perhaps you can use a ByteArrayInputStream here to your advantage. Hope that helps -Todd On Wed, Apr 8, 2009 at 4:59 PM, bzheng <[email protected]> wrote: > > I tried to store protocolbuffer as BytesWritable in a sequence file <Text, > BytesWritable>. It's stored using SequenceFile.Writer(new Text(key), new > BytesWritable(protobuf.convertToBytes())). When reading the values from > key/value pairs using value.get(), it returns more then what's stored. > However, value.getSize() returns the correct number. This means in order > to > convert the byte[] to protocol buffer again, I have to do > Arrays.copyOf(value.get(), value.getSize()). This happens on both version > 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for a > few entries in the sequence file below. The extra bytes in value.get() all > have values of zero. > > value.getSize(): 7066 value.get().length: 10599 > value.getSize(): 36456 value.get().length: 54684 > value.getSize(): 32275 value.get().length: 54684 > value.getSize(): 40561 value.get().length: 54684 > value.getSize(): 16855 value.get().length: 54684 > value.getSize(): 66304 value.get().length: 99456 > value.getSize(): 26488 value.get().length: 99456 > value.getSize(): 59327 value.get().length: 99456 > value.getSize(): 36865 value.get().length: 99456 > > -- > View this message in context: > http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
