Re: [protobuf] ProtocolBuffer + compression in hadoop?

Yang Fri, 19 Feb 2010 12:39:57 -0800

Thanks,  parse/writeDelimited() is exactly what I needed.  I see that the
LimitedStream underneath restricts reading from the original stream,
so we do not need to re-use stream




On Fri, Feb 19, 2010 at 11:53 AM, Kenton Varda <[email protected]> wrote:

> If the underlying stream does not provide its own boundaries then you need
> to prefix the protocol message with a size.  Hacking an end-of-record
> "feature" into the protobuf code is probably not a good idea.  We already
> provide parseDelimitedFrom()/writeDelemitedTo() which prefix the message
> with a size.  If you were to use the features we provide rather than invent
> your own, you wouldn't have these problems.
>
> But it is very surprising to me that the library you are using does not
> provide its own delimitation.  What would happen if your records were just
> blobs of text instead of protocol buffers?  Blobs of text do not delimit
> themselves.  Even XML and JSON can have arbitrary amounts of whitespace at
> the end.  Why do you expect protocol buffers to self-delimit when no other
> format does this?
>
>
> On Fri, Feb 19, 2010 at 11:01 AM, Yang <[email protected]> wrote:
>
>> for your last comment, yes, the end-of-record indicator was another hack I
>> put in.
>>
>> but both your options above ultimately require the underlying stream to
>> provide exact record boundaries.
>> in the last email I pointed out that this may or may not be a valid
>> requirement for the underlying inputstream, since only by looking at the
>> byte stream itself
>> can we figure it out. but when we write the stream, the underlying stream
>> could insert some markers itself, and utilize it later.
>>
>>
>> On Fri, Feb 19, 2010 at 2:16 AM, Kenton Varda <[email protected]> wrote:
>>
>>> Two options:
>>>
>>> 1) Do not use parseFrom(InputStream).  Use parseFrom(byte[]).  Read the
>>> byte array from the stream yourself, so you can make sure to read only the
>>> correct number of bytes.
>>>
>>> 2) Create a FilteredInputStream subclass which limits reading to some
>>> number of bytes.  Wrap your InputStream in that, then pass the wrapper to
>>> parseFrom(InputStream).  Then it cannot accidentally read too much.
>>>
>>> Honestly, I don't understand how your stuff is working now.  Why is the
>>> parser not throwing an exception when it gets to the end of the message and
>>> finds garbage?  How does it know to stop?  What do you mean by "I put
>>> and End-Of-Record to the protocol"?  The protobuf encoding does not define
>>> any "end-of-record" indicator.
>>>
>>>
>>> On Fri, Feb 19, 2010 at 12:31 AM, Yang <[email protected]> wrote:
>>>
>>>> I found the issue, this has the same root cause as a previous issue I
>>>> reported on this forum.
>>>>
>>>> basically I think PB assumes that it stops only where the provided
>>>> stream has ended, otherwise it keeps on reading.
>>>> in the last issue the buffer was too long and it read in further junk,
>>>> so I put and End-Of-Record to the protocol,
>>>>
>>>> this time the issue is that I have a stream of 20 records, and the first
>>>> time PB called refillBuffer(), all 20 records (140 bytes) are slurped in
>>>> from
>>>> the underlying stream, and stored inside PB's internal buffer. but  PB
>>>> only parses the first of them, and leaves the rest to waste.
>>>> somehow due to the interface of SequenceFile and its Inputstream to PB,
>>>> SequenceFile decides to call parseFrom() again, this time, PB 
>>>> refillBuffer()
>>>> reached real EOF and returns 0 from readTag(), so returning a record
>>>> with no fields.
>>>>
>>>>
>>>> overall it comes down to the question whether we should expect the
>>>> underlying stream to give PB the exact buffer range to read every time, 
>>>> i.e.
>>>> giving
>>>> EOF at exactly the end of record. if not, PB needs to keep the state of
>>>> the underlying stream, instead of creating a new CodedInputStream every 
>>>> time
>>>> parseFrom() is called; on the other hand,  you could argue that record
>>>> boundary definition is the job of the underlying stream/FileFormat
>>>> implementation,
>>>> and SequenceFile should do a better job at this.
>>>>
>>>>
>>>> for now, for my current project I just hacked AbstracMessageLite to use
>>>> a persistent CodedInputStream  and it worked.
>>>>
>>>>
>>>> On Thu, Feb 18, 2010 at 1:14 PM, Christopher Smith 
>>>> <[email protected]>wrote:
>>>>
>>>>> Is this a case of needing to delimit the input? I'm not familiar with
>>>>> SplitterInputStream, but I'm wondering if it does the right thing for this
>>>>> to work.
>>>>>
>>>>> --Chris
>>>>>
>>>>>
>>>>> On Thu, Feb 18, 2010 at 12:56 PM, Kenton Varda <[email protected]>wrote:
>>>>>
>>>>>> Please reply-all so the mailing list stays CC'd.  I don't know
>>>>>> anything about the libraries you are using so I can't really help you
>>>>>> further.  Maybe someone else can.
>>>>>>
>>>>>> On Thu, Feb 18, 2010 at 12:46 PM, Yang <[email protected]> wrote:
>>>>>>
>>>>>>> thanks Kenton,
>>>>>>>
>>>>>>> I thought about the same,
>>>>>>> what I did was that I use a splitter stream, and split the actual
>>>>>>> input stream into 2, dumping out one for debugging, and feeding the 
>>>>>>> other
>>>>>>> one to
>>>>>>> PB.
>>>>>>>
>>>>>>>
>>>>>>> my code for Hadoop is
>>>>>>>
>>>>>>> Writable.readFields( Datainput in ) {
>>>>>>>
>>>>>>>     SplitterInputStream ios = new SplitterInputStream(in);
>>>>>>>
>>>>>>>     pb_object = MyPBClass.parseFrom(ios);
>>>>>>> }
>>>>>>>
>>>>>>> SplitterInputStream dumps out the actual bytes, and the resulting
>>>>>>> byte stream is
>>>>>>> indeed (decimal)
>>>>>>>
>>>>>>> 10 2 79 79  16 1  ... repeating 20 times\
>>>>>>>
>>>>>>> which is 20 records of
>>>>>>> message {
>>>>>>>   1: string name ;  // taking a value of "yy"
>>>>>>>   2: i32     Id;   //taking a value of 1
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> indeed, in compression or non-compression mode, the dumped out
>>>>>>> bytestream is the same.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 18, 2010 at 12:03 PM, Kenton Varda <[email protected]>wrote:
>>>>>>>
>>>>>>>> You should verify that the bytes that come out of the InputStream
>>>>>>>> really are the exact same bytes that were written by the serializer to 
>>>>>>>> the
>>>>>>>> OutputStream originally.  You could do this by computing a checksum at 
>>>>>>>> both
>>>>>>>> ends and printing it, then inspecting visually.  You'll probably find 
>>>>>>>> that
>>>>>>>> the bytes differ somehow, or don't end at the same point.
>>>>>>>>
>>>>>>>> On Thu, Feb 18, 2010 at 2:48 AM, Yang <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> I tried to use protocol buffer in hadoop,
>>>>>>>>>
>>>>>>>>> so far it works fine with SequenceFile, after I hook it up with a
>>>>>>>>> simple wrapper,
>>>>>>>>>
>>>>>>>>> but after I put in a compressor in sequenceFile, it fails, because
>>>>>>>>> it read all the messages and yet still wants to advance the read 
>>>>>>>>> pointer,
>>>>>>>>> and
>>>>>>>>> then readTag() returns 0, so the mergeFrom() returns a message with
>>>>>>>>> no fields set.
>>>>>>>>>
>>>>>>>>> anybody familiar with both SequenceFile and protocol buffer has an
>>>>>>>>> idea why it fails like this?
>>>>>>>>> I find it difficult to understand because the InputStream is simply
>>>>>>>>> the same, whether it comes through a compressor or not
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "Protocol Buffers" group.
>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>>> [email protected]<protobuf%[email protected]>
>>>>>>>>> .
>>>>>>>>> For more options, visit this group at
>>>>>>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Protocol Buffers" group.
>>>>>> To post to this group, send email to [email protected].
>>>>>> To unsubscribe from this group, send email to
>>>>>> [email protected]<protobuf%[email protected]>
>>>>>> .
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chris
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Protocol Buffers" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<protobuf%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>
>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] ProtocolBuffer + compression in hadoop?

Reply via email to