On Wed, Apr 8, 2009 at 7:14 PM, bzheng <[email protected]> wrote: > > Thanks for the clarification. Though I still find it strange why not have > the get() method return what's actually stored regardless of buffer size. > Is there any reason why you'd want to use/examine what's in the buffer? >
Because doing so requires an array copy. It's important for hadoop performance to avoid needless copies of data when they're unnecessary. Most APIs that take byte[] arrays have a version that includes an offset and length. -Todd > > > Todd Lipcon-4 wrote: > > > > Hi Bing, > > > > The issue here is that BytesWritable uses an internal buffer which is > > grown > > but not shrunk. The cause of this is that Writables in general are single > > instances that are shared across multiple input records. If you look at > > the > > internals of the input reader, you'll see that a single BytesWritable is > > instantiated, and then each time a record is read, it's read into that > > same > > instance. The purpose here is to avoid the allocation cost for each row. > > > > The end result is, as you've seen, that getBytes() returns an array which > > may be larger than the actual amount of data. In fact, the extra bytes > > (between .getSize() and .get().length) have undefined contents, not zero. > > > > Unfortunately, if the protobuffer API doesn't allow you to deserialize > out > > of a smaller portion of a byte array, you're out of luck and will have to > > do > > the copy like you've mentioned. I imagine, though, that there's some way > > around this in the protobuffer API - perhaps you can use a > > ByteArrayInputStream here to your advantage. > > > > Hope that helps > > -Todd > > > > On Wed, Apr 8, 2009 at 4:59 PM, bzheng <[email protected]> wrote: > > > >> > >> I tried to store protocolbuffer as BytesWritable in a sequence file > >> <Text, > >> BytesWritable>. It's stored using SequenceFile.Writer(new Text(key), > new > >> BytesWritable(protobuf.convertToBytes())). When reading the values from > >> key/value pairs using value.get(), it returns more then what's stored. > >> However, value.getSize() returns the correct number. This means in > order > >> to > >> convert the byte[] to protocol buffer again, I have to do > >> Arrays.copyOf(value.get(), value.getSize()). This happens on both > >> version > >> 0.17.2 and 0.18.3. Does anyone know why this happens? Sample sizes for > >> a > >> few entries in the sequence file below. The extra bytes in value.get() > >> all > >> have values of zero. > >> > >> value.getSize(): 7066 value.get().length: 10599 > >> value.getSize(): 36456 value.get().length: 54684 > >> value.getSize(): 32275 value.get().length: 54684 > >> value.getSize(): 40561 value.get().length: 54684 > >> value.getSize(): 16855 value.get().length: 54684 > >> value.getSize(): 66304 value.get().length: 99456 > >> value.getSize(): 26488 value.get().length: 99456 > >> value.getSize(): 59327 value.get().length: 99456 > >> value.getSize(): 36865 value.get().length: 99456 > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
