Re: [protobuf] ProtocolBuffer + compression in hadoop?

Yang Fri, 19 Feb 2010 00:31:12 -0800

I found the issue, this has the same root cause as a previous issue I
reported on this forum.


basically I think PB assumes that it stops only where the provided stream
has ended, otherwise it keeps on reading.
in the last issue the buffer was too long and it read in further junk, so I
put and End-Of-Record to the protocol,

this time the issue is that I have a stream of 20 records, and the first
time PB called refillBuffer(), all 20 records (140 bytes) are slurped in
from
the underlying stream, and stored inside PB's internal buffer. but  PB only
parses the first of them, and leaves the rest to waste.
somehow due to the interface of SequenceFile and its Inputstream to PB,
SequenceFile decides to call parseFrom() again, this time, PB refillBuffer()
reached real EOF and returns 0 from readTag(), so returning a record with no
fields.


overall it comes down to the question whether we should expect the
underlying stream to give PB the exact buffer range to read every time, i.e.
giving
EOF at exactly the end of record. if not, PB needs to keep the state of the
underlying stream, instead of creating a new CodedInputStream every time
parseFrom() is called; on the other hand,  you could argue that record
boundary definition is the job of the underlying stream/FileFormat
implementation,
and SequenceFile should do a better job at this.


for now, for my current project I just hacked AbstracMessageLite to use a
persistent CodedInputStream  and it worked.

On Thu, Feb 18, 2010 at 1:14 PM, Christopher Smith <[email protected]>wrote:

> Is this a case of needing to delimit the input? I'm not familiar with
> SplitterInputStream, but I'm wondering if it does the right thing for this
> to work.
>
> --Chris
>
>
> On Thu, Feb 18, 2010 at 12:56 PM, Kenton Varda <[email protected]> wrote:
>
>> Please reply-all so the mailing list stays CC'd.  I don't know anything
>> about the libraries you are using so I can't really help you further.  Maybe
>> someone else can.
>>
>> On Thu, Feb 18, 2010 at 12:46 PM, Yang <[email protected]> wrote:
>>
>>> thanks Kenton,
>>>
>>> I thought about the same,
>>> what I did was that I use a splitter stream, and split the actual input
>>> stream into 2, dumping out one for debugging, and feeding the other one to
>>> PB.
>>>
>>>
>>> my code for Hadoop is
>>>
>>> Writable.readFields( Datainput in ) {
>>>
>>>     SplitterInputStream ios = new SplitterInputStream(in);
>>>
>>>     pb_object = MyPBClass.parseFrom(ios);
>>> }
>>>
>>> SplitterInputStream dumps out the actual bytes, and the resulting byte
>>> stream is
>>> indeed (decimal)
>>>
>>> 10 2 79 79  16 1  ... repeating 20 times\
>>>
>>> which is 20 records of
>>> message {
>>>   1: string name ;  // taking a value of "yy"
>>>   2: i32     Id;   //taking a value of 1
>>> }
>>>
>>>
>>>
>>> indeed, in compression or non-compression mode, the dumped out bytestream
>>> is the same.
>>>
>>>
>>>
>>> On Thu, Feb 18, 2010 at 12:03 PM, Kenton Varda <[email protected]>wrote:
>>>
>>>> You should verify that the bytes that come out of the InputStream really
>>>> are the exact same bytes that were written by the serializer to the
>>>> OutputStream originally.  You could do this by computing a checksum at both
>>>> ends and printing it, then inspecting visually.  You'll probably find that
>>>> the bytes differ somehow, or don't end at the same point.
>>>>
>>>> On Thu, Feb 18, 2010 at 2:48 AM, Yang <[email protected]> wrote:
>>>>
>>>>> I tried to use protocol buffer in hadoop,
>>>>>
>>>>> so far it works fine with SequenceFile, after I hook it up with a
>>>>> simple wrapper,
>>>>>
>>>>> but after I put in a compressor in sequenceFile, it fails, because it
>>>>> read all the messages and yet still wants to advance the read pointer, and
>>>>> then readTag() returns 0, so the mergeFrom() returns a message with no
>>>>> fields set.
>>>>>
>>>>> anybody familiar with both SequenceFile and protocol buffer has an idea
>>>>> why it fails like this?
>>>>> I find it difficult to understand because the InputStream is simply the
>>>>> same, whether it comes through a compressor or not
>>>>>
>>>>>
>>>>> thanks
>>>>> Yang
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Protocol Buffers" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]<protobuf%[email protected]>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/protobuf?hl=en.
>>>>>
>>>>
>>>>
>>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Protocol Buffers" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<protobuf%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/protobuf?hl=en.
>>
>
>
>
> --
> Chris
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] ProtocolBuffer + compression in hadoop?

Reply via email to