Re: [protobuf] ProtocolBuffer + compression in hadoop?

Yang Fri, 19 Feb 2010 11:01:34 -0800

for your last comment, yes, the end-of-record indicator was another hack I
put in.


but both your options above ultimately require the underlying stream to
provide exact record boundaries.
in the last email I pointed out that this may or may not be a valid
requirement for the underlying inputstream, since only by looking at the
byte stream itself
can we figure it out. but when we write the stream, the underlying stream
could insert some markers itself, and utilize it later.

On Fri, Feb 19, 2010 at 2:16 AM, Kenton Varda <[email protected]> wrote:

> Two options:
>
> 1) Do not use parseFrom(InputStream).  Use parseFrom(byte[]).  Read the
> byte array from the stream yourself, so you can make sure to read only the
> correct number of bytes.
>
> 2) Create a FilteredInputStream subclass which limits reading to some
> number of bytes.  Wrap your InputStream in that, then pass the wrapper to
> parseFrom(InputStream).  Then it cannot accidentally read too much.
>
> Honestly, I don't understand how your stuff is working now.  Why is the
> parser not throwing an exception when it gets to the end of the message and
> finds garbage?  How does it know to stop?  What do you mean by "I put and
> End-Of-Record to the protocol"?  The protobuf encoding does not define any
> "end-of-record" indicator.
>
>
> On Fri, Feb 19, 2010 at 12:31 AM, Yang <[email protected]> wrote:
>
>> I found the issue, this has the same root cause as a previous issue I
>> reported on this forum.
>>
>> basically I think PB assumes that it stops only where the provided stream
>> has ended, otherwise it keeps on reading.
>> in the last issue the buffer was too long and it read in further junk, so
>> I put and End-Of-Record to the protocol,
>>
>> this time the issue is that I have a stream of 20 records, and the first
>> time PB called refillBuffer(), all 20 records (140 bytes) are slurped in
>> from
>> the underlying stream, and stored inside PB's internal buffer. but  PB
>> only parses the first of them, and leaves the rest to waste.
>> somehow due to the interface of SequenceFile and its Inputstream to PB,
>> SequenceFile decides to call parseFrom() again, this time, PB refillBuffer()
>> reached real EOF and returns 0 from readTag(), so returning a record with
>> no fields.
>>
>>
>> overall it comes down to the question whether we should expect the
>> underlying stream to give PB the exact buffer range to read every time, i.e.
>> giving
>> EOF at exactly the end of record. if not, PB needs to keep the state of
>> the underlying stream, instead of creating a new CodedInputStream every time
>> parseFrom() is called; on the other hand,  you could argue that record
>> boundary definition is the job of the underlying stream/FileFormat
>> implementation,
>> and SequenceFile should do a better job at this.
>>
>>
>> for now, for my current project I just hacked AbstracMessageLite to use a
>> persistent CodedInputStream  and it worked.
>>
>>
>> On Thu, Feb 18, 2010 at 1:14 PM, Christopher Smith <[email protected]>wrote:
>>
>>> Is this a case of needing to delimit the input? I'm not familiar with
>>> SplitterInputStream, but I'm wondering if it does the right thing for this
>>> to work.
>>>
>>> --Chris
>>>
>>>
>>> On Thu, Feb 18, 2010 at 12:56 PM, Kenton Varda <[email protected]>wrote:
>>>
>>>> Please reply-all so the mailing list stays CC'd.  I don't know anything
>>>> about the libraries you are using so I can't really help you further.  
>>>> Maybe
>>>> someone else can.
>>>>
>>>> On Thu, Feb 18, 2010 at 12:46 PM, Yang <[email protected]> wrote:
>>>>
>>>>> thanks Kenton,
>>>>>
>>>>> I thought about the same,
>>>>> what I did was that I use a splitter stream, and split the actual input
>>>>> stream into 2, dumping out one for debugging, and feeding the other one to
>>>>> PB.
>>>>>
>>>>>
>>>>> my code for Hadoop is
>>>>>
>>>>> Writable.readFields( Datainput in ) {
>>>>>
>>>>>     SplitterInputStream ios = new SplitterInputStream(in);
>>>>>
>>>>>     pb_object = MyPBClass.parseFrom(ios);
>>>>> }
>>>>>
>>>>> SplitterInputStream dumps out the actual bytes, and the resulting byte
>>>>> stream is
>>>>> indeed (decimal)
>>>>>
>>>>> 10 2 79 79  16 1  ... repeating 20 times\
>>>>>
>>>>> which is 20 records of
>>>>> message {
>>>>>   1: string name ;  // taking a value of "yy"
>>>>>   2: i32     Id;   //taking a value of 1
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> indeed, in compression or non-compression mode, the dumped out
>>>>> bytestream is the same.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 18, 2010 at 12:03 PM, Kenton Varda <[email protected]>wrote:
>>>>>
>>>>>> You should verify that the bytes that come out of the InputStream
>>>>>> really are the exact same bytes that were written by the serializer to 
>>>>>> the
>>>>>> OutputStream originally.  You could do this by computing a checksum at 
>>>>>> both
>>>>>> ends and printing it, then inspecting visually.  You'll probably find 
>>>>>> that
>>>>>> the bytes differ somehow, or don't end at the same point.
>>>>>>
>>>>>> On Thu, Feb 18, 2010 at 2:48 AM, Yang <[email protected]> wrote:
>>>>>>
>>>>>>> I tried to use protocol buffer in hadoop,
>>>>>>>
>>>>>>> so far it works fine with SequenceFile, after I hook it up with a
>>>>>>> simple wrapper,
>>>>>>>
>>>>>>> but after I put in a compressor in sequenceFile, it fails, because it
>>>>>>> read all the messages and yet still wants to advance the read pointer, 
>>>>>>> and
>>>>>>> then readTag() returns 0, so the mergeFrom() returns a message with
>>>>>>> no fields set.
>>>>>>>
>>>>>>> anybody familiar with both SequenceFile and protocol buffer has an
>>>>>>> idea why it fails like this?
>>>>>>> I find it difficult to understand because the InputStream is simply
>>>>>>> the same, whether it comes through a compressor or not
>>>>>>>
>>>>>>>
>>>>>>> thanks
>>>>>>> Yang
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Protocol Buffers" group.
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> [email protected]<protobuf%[email protected]>
>>>>>>> .
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Protocol Buffers" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<protobuf%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>
>>>
>>>
>>>
>>> --
>>> Chris
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Protocol Buffers" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<protobuf%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/protobuf?hl=en.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] ProtocolBuffer + compression in hadoop?

Reply via email to