Arrays.copyOf isn't required, protocol buffer has a method to merge
from bytes. you can do:
protobuf.newBuilder().mergeFrom(value.getBytes(), 0, value.getLength())
the above is for hadoop 0.19.1, the corresponding method names for
BytesWritable for earlier version of hadoop might be slightly different.
--
gaurav
On Apr 8, 2009, at 7:13 PM, Todd Lipcon wrote:
Hi Bing,
The issue here is that BytesWritable uses an internal buffer which
is grown
but not shrunk. The cause of this is that Writables in general are
single
instances that are shared across multiple input records. If you look
at the
internals of the input reader, you'll see that a single
BytesWritable is
instantiated, and then each time a record is read, it's read into
that same
instance. The purpose here is to avoid the allocation cost for each
row.
The end result is, as you've seen, that getBytes() returns an array
which
may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not
zero.
Unfortunately, if the protobuffer API doesn't allow you to
deserialize out
of a smaller portion of a byte array, you're out of luck and will
have to do
the copy like you've mentioned. I imagine, though, that there's some
way
around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.
Hope that helps
-Todd
On Wed, Apr 8, 2009 at 4:59 PM, bzheng <[email protected]> wrote:
I tried to store protocolbuffer as BytesWritable in a sequence file
<Text,
BytesWritable>. It's stored using SequenceFile.Writer(new
Text(key), new
BytesWritable(protobuf.convertToBytes())). When reading the values
from
key/value pairs using value.get(), it returns more then what's
stored.
However, value.getSize() returns the correct number. This means in
order
to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()). This happens on both
version
0.17.2 and 0.18.3. Does anyone know why this happens? Sample
sizes for a
few entries in the sequence file below. The extra bytes in
value.get() all
have values of zero.
value.getSize(): 7066 value.get().length: 10599
value.getSize(): 36456 value.get().length: 54684
value.getSize(): 32275 value.get().length: 54684
value.getSize(): 40561 value.get().length: 54684
value.getSize(): 16855 value.get().length: 54684
value.getSize(): 66304 value.get().length: 99456
value.getSize(): 26488 value.get().length: 99456
value.getSize(): 59327 value.get().length: 99456
value.getSize(): 36865 value.get().length: 99456
--
View this message in context:
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.