Re: [protobuf] ProtocolBuffer + compression in hadoop?

Kenton Varda Fri, 19 Feb 2010 11:55:20 -0800

If the underlying stream does not provide its own boundaries then you need
to prefix the protocol message with a size.  Hacking an end-of-record
"feature" into the protobuf code is probably not a good idea.  We already
provide parseDelimitedFrom()/writeDelemitedTo() which prefix the message
with a size.  If you were to use the features we provide rather than invent
your own, you wouldn't have these problems.


But it is very surprising to me that the library you are using does not
provide its own delimitation.  What would happen if your records were just
blobs of text instead of protocol buffers?  Blobs of text do not delimit
themselves.  Even XML and JSON can have arbitrary amounts of whitespace at
the end.  Why do you expect protocol buffers to self-delimit when no other
format does this?

On Fri, Feb 19, 2010 at 11:01 AM, Yang <teddyyyy...@gmail.com> wrote:

> for your last comment, yes, the end-of-record indicator was another hack I
> put in.
>
> but both your options above ultimately require the underlying stream to
> provide exact record boundaries.
> in the last email I pointed out that this may or may not be a valid
> requirement for the underlying inputstream, since only by looking at the
> byte stream itself
> can we figure it out. but when we write the stream, the underlying stream
> could insert some markers itself, and utilize it later.
>
>
> On Fri, Feb 19, 2010 at 2:16 AM, Kenton Varda <ken...@google.com> wrote:
>
>> Two options:
>>
>> 1) Do not use parseFrom(InputStream).  Use parseFrom(byte[]).  Read the
>> byte array from the stream yourself, so you can make sure to read only the
>> correct number of bytes.
>>
>> 2) Create a FilteredInputStream subclass which limits reading to some
>> number of bytes.  Wrap your InputStream in that, then pass the wrapper to
>> parseFrom(InputStream).  Then it cannot accidentally read too much.
>>
>> Honestly, I don't understand how your stuff is working now.  Why is the
>> parser not throwing an exception when it gets to the end of the message and
>> finds garbage?  How does it know to stop?  What do you mean by "I put and
>> End-Of-Record to the protocol"?  The protobuf encoding does not define any
>> "end-of-record" indicator.
>>
>>
>> On Fri, Feb 19, 2010 at 12:31 AM, Yang <teddyyyy...@gmail.com> wrote:
>>
>>> I found the issue, this has the same root cause as a previous issue I
>>> reported on this forum.
>>>
>>> basically I think PB assumes that it stops only where the provided stream
>>> has ended, otherwise it keeps on reading.
>>> in the last issue the buffer was too long and it read in further junk, so
>>> I put and End-Of-Record to the protocol,
>>>
>>> this time the issue is that I have a stream of 20 records, and the first
>>> time PB called refillBuffer(), all 20 records (140 bytes) are slurped in
>>> from
>>> the underlying stream, and stored inside PB's internal buffer. but  PB
>>> only parses the first of them, and leaves the rest to waste.
>>> somehow due to the interface of SequenceFile and its Inputstream to PB,
>>> SequenceFile decides to call parseFrom() again, this time, PB refillBuffer()
>>> reached real EOF and returns 0 from readTag(), so returning a record with
>>> no fields.
>>>
>>>
>>> overall it comes down to the question whether we should expect the
>>> underlying stream to give PB the exact buffer range to read every time, i.e.
>>> giving
>>> EOF at exactly the end of record. if not, PB needs to keep the state of
>>> the underlying stream, instead of creating a new CodedInputStream every time
>>> parseFrom() is called; on the other hand,  you could argue that record
>>> boundary definition is the job of the underlying stream/FileFormat
>>> implementation,
>>> and SequenceFile should do a better job at this.
>>>
>>>
>>> for now, for my current project I just hacked AbstracMessageLite to use a
>>> persistent CodedInputStream  and it worked.
>>>
>>>
>>> On Thu, Feb 18, 2010 at 1:14 PM, Christopher Smith <cbsm...@gmail.com>wrote:
>>>
>>>> Is this a case of needing to delimit the input? I'm not familiar with
>>>> SplitterInputStream, but I'm wondering if it does the right thing for this
>>>> to work.
>>>>
>>>> --Chris
>>>>
>>>>
>>>> On Thu, Feb 18, 2010 at 12:56 PM, Kenton Varda <ken...@google.com>wrote:
>>>>
>>>>> Please reply-all so the mailing list stays CC'd.  I don't know anything
>>>>> about the libraries you are using so I can't really help you further.  
>>>>> Maybe
>>>>> someone else can.
>>>>>
>>>>> On Thu, Feb 18, 2010 at 12:46 PM, Yang <teddyyyy...@gmail.com> wrote:
>>>>>
>>>>>> thanks Kenton,
>>>>>>
>>>>>> I thought about the same,
>>>>>> what I did was that I use a splitter stream, and split the actual
>>>>>> input stream into 2, dumping out one for debugging, and feeding the other
>>>>>> one to
>>>>>> PB.
>>>>>>
>>>>>>
>>>>>> my code for Hadoop is
>>>>>>
>>>>>> Writable.readFields( Datainput in ) {
>>>>>>
>>>>>>     SplitterInputStream ios = new SplitterInputStream(in);
>>>>>>
>>>>>>     pb_object = MyPBClass.parseFrom(ios);
>>>>>> }
>>>>>>
>>>>>> SplitterInputStream dumps out the actual bytes, and the resulting byte
>>>>>> stream is
>>>>>> indeed (decimal)
>>>>>>
>>>>>> 10 2 79 79  16 1  ... repeating 20 times\
>>>>>>
>>>>>> which is 20 records of
>>>>>> message {
>>>>>>   1: string name ;  // taking a value of "yy"
>>>>>>   2: i32     Id;   //taking a value of 1
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> indeed, in compression or non-compression mode, the dumped out
>>>>>> bytestream is the same.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 18, 2010 at 12:03 PM, Kenton Varda <ken...@google.com>wrote:
>>>>>>
>>>>>>> You should verify that the bytes that come out of the InputStream
>>>>>>> really are the exact same bytes that were written by the serializer to 
>>>>>>> the
>>>>>>> OutputStream originally.  You could do this by computing a checksum at 
>>>>>>> both
>>>>>>> ends and printing it, then inspecting visually.  You'll probably find 
>>>>>>> that
>>>>>>> the bytes differ somehow, or don't end at the same point.
>>>>>>>
>>>>>>> On Thu, Feb 18, 2010 at 2:48 AM, Yang <teddyyyy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I tried to use protocol buffer in hadoop,
>>>>>>>>
>>>>>>>> so far it works fine with SequenceFile, after I hook it up with a
>>>>>>>> simple wrapper,
>>>>>>>>
>>>>>>>> but after I put in a compressor in sequenceFile, it fails, because
>>>>>>>> it read all the messages and yet still wants to advance the read 
>>>>>>>> pointer,
>>>>>>>> and
>>>>>>>> then readTag() returns 0, so the mergeFrom() returns a message with
>>>>>>>> no fields set.
>>>>>>>>
>>>>>>>> anybody familiar with both SequenceFile and protocol buffer has an
>>>>>>>> idea why it fails like this?
>>>>>>>> I find it difficult to understand because the InputStream is simply
>>>>>>>> the same, whether it comes through a compressor or not
>>>>>>>>
>>>>>>>>
>>>>>>>> thanks
>>>>>>>> Yang
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "Protocol Buffers" group.
>>>>>>>> To post to this group, send email to proto...@googlegroups.com.
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com>
>>>>>>>> .
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Protocol Buffers" group.
>>>>> To post to this group, send email to proto...@googlegroups.com.
>>>>> To unsubscribe from this group, send email to
>>>>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chris
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "Protocol Buffers" group.
>>> To post to this group, send email to proto...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/protobuf?hl=en.
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] ProtocolBuffer + compression in hadoop?

Reply via email to