Thanks, parse/writeDelimited() is exactly what I needed. I see that the LimitedStream underneath restricts reading from the original stream, so we do not need to re-use stream
On Fri, Feb 19, 2010 at 11:53 AM, Kenton Varda <ken...@google.com> wrote: > If the underlying stream does not provide its own boundaries then you need > to prefix the protocol message with a size. Hacking an end-of-record > "feature" into the protobuf code is probably not a good idea. We already > provide parseDelimitedFrom()/writeDelemitedTo() which prefix the message > with a size. If you were to use the features we provide rather than invent > your own, you wouldn't have these problems. > > But it is very surprising to me that the library you are using does not > provide its own delimitation. What would happen if your records were just > blobs of text instead of protocol buffers? Blobs of text do not delimit > themselves. Even XML and JSON can have arbitrary amounts of whitespace at > the end. Why do you expect protocol buffers to self-delimit when no other > format does this? > > > On Fri, Feb 19, 2010 at 11:01 AM, Yang <teddyyyy...@gmail.com> wrote: > >> for your last comment, yes, the end-of-record indicator was another hack I >> put in. >> >> but both your options above ultimately require the underlying stream to >> provide exact record boundaries. >> in the last email I pointed out that this may or may not be a valid >> requirement for the underlying inputstream, since only by looking at the >> byte stream itself >> can we figure it out. but when we write the stream, the underlying stream >> could insert some markers itself, and utilize it later. >> >> >> On Fri, Feb 19, 2010 at 2:16 AM, Kenton Varda <ken...@google.com> wrote: >> >>> Two options: >>> >>> 1) Do not use parseFrom(InputStream). Use parseFrom(byte[]). Read the >>> byte array from the stream yourself, so you can make sure to read only the >>> correct number of bytes. >>> >>> 2) Create a FilteredInputStream subclass which limits reading to some >>> number of bytes. Wrap your InputStream in that, then pass the wrapper to >>> parseFrom(InputStream). Then it cannot accidentally read too much. >>> >>> Honestly, I don't understand how your stuff is working now. Why is the >>> parser not throwing an exception when it gets to the end of the message and >>> finds garbage? How does it know to stop? What do you mean by "I put >>> and End-Of-Record to the protocol"? The protobuf encoding does not define >>> any "end-of-record" indicator. >>> >>> >>> On Fri, Feb 19, 2010 at 12:31 AM, Yang <teddyyyy...@gmail.com> wrote: >>> >>>> I found the issue, this has the same root cause as a previous issue I >>>> reported on this forum. >>>> >>>> basically I think PB assumes that it stops only where the provided >>>> stream has ended, otherwise it keeps on reading. >>>> in the last issue the buffer was too long and it read in further junk, >>>> so I put and End-Of-Record to the protocol, >>>> >>>> this time the issue is that I have a stream of 20 records, and the first >>>> time PB called refillBuffer(), all 20 records (140 bytes) are slurped in >>>> from >>>> the underlying stream, and stored inside PB's internal buffer. but PB >>>> only parses the first of them, and leaves the rest to waste. >>>> somehow due to the interface of SequenceFile and its Inputstream to PB, >>>> SequenceFile decides to call parseFrom() again, this time, PB >>>> refillBuffer() >>>> reached real EOF and returns 0 from readTag(), so returning a record >>>> with no fields. >>>> >>>> >>>> overall it comes down to the question whether we should expect the >>>> underlying stream to give PB the exact buffer range to read every time, >>>> i.e. >>>> giving >>>> EOF at exactly the end of record. if not, PB needs to keep the state of >>>> the underlying stream, instead of creating a new CodedInputStream every >>>> time >>>> parseFrom() is called; on the other hand, you could argue that record >>>> boundary definition is the job of the underlying stream/FileFormat >>>> implementation, >>>> and SequenceFile should do a better job at this. >>>> >>>> >>>> for now, for my current project I just hacked AbstracMessageLite to use >>>> a persistent CodedInputStream and it worked. >>>> >>>> >>>> On Thu, Feb 18, 2010 at 1:14 PM, Christopher Smith >>>> <cbsm...@gmail.com>wrote: >>>> >>>>> Is this a case of needing to delimit the input? I'm not familiar with >>>>> SplitterInputStream, but I'm wondering if it does the right thing for this >>>>> to work. >>>>> >>>>> --Chris >>>>> >>>>> >>>>> On Thu, Feb 18, 2010 at 12:56 PM, Kenton Varda <ken...@google.com>wrote: >>>>> >>>>>> Please reply-all so the mailing list stays CC'd. I don't know >>>>>> anything about the libraries you are using so I can't really help you >>>>>> further. Maybe someone else can. >>>>>> >>>>>> On Thu, Feb 18, 2010 at 12:46 PM, Yang <teddyyyy...@gmail.com> wrote: >>>>>> >>>>>>> thanks Kenton, >>>>>>> >>>>>>> I thought about the same, >>>>>>> what I did was that I use a splitter stream, and split the actual >>>>>>> input stream into 2, dumping out one for debugging, and feeding the >>>>>>> other >>>>>>> one to >>>>>>> PB. >>>>>>> >>>>>>> >>>>>>> my code for Hadoop is >>>>>>> >>>>>>> Writable.readFields( Datainput in ) { >>>>>>> >>>>>>> SplitterInputStream ios = new SplitterInputStream(in); >>>>>>> >>>>>>> pb_object = MyPBClass.parseFrom(ios); >>>>>>> } >>>>>>> >>>>>>> SplitterInputStream dumps out the actual bytes, and the resulting >>>>>>> byte stream is >>>>>>> indeed (decimal) >>>>>>> >>>>>>> 10 2 79 79 16 1 ... repeating 20 times\ >>>>>>> >>>>>>> which is 20 records of >>>>>>> message { >>>>>>> 1: string name ; // taking a value of "yy" >>>>>>> 2: i32 Id; //taking a value of 1 >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> indeed, in compression or non-compression mode, the dumped out >>>>>>> bytestream is the same. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 18, 2010 at 12:03 PM, Kenton Varda <ken...@google.com>wrote: >>>>>>> >>>>>>>> You should verify that the bytes that come out of the InputStream >>>>>>>> really are the exact same bytes that were written by the serializer to >>>>>>>> the >>>>>>>> OutputStream originally. You could do this by computing a checksum at >>>>>>>> both >>>>>>>> ends and printing it, then inspecting visually. You'll probably find >>>>>>>> that >>>>>>>> the bytes differ somehow, or don't end at the same point. >>>>>>>> >>>>>>>> On Thu, Feb 18, 2010 at 2:48 AM, Yang <teddyyyy...@gmail.com>wrote: >>>>>>>> >>>>>>>>> I tried to use protocol buffer in hadoop, >>>>>>>>> >>>>>>>>> so far it works fine with SequenceFile, after I hook it up with a >>>>>>>>> simple wrapper, >>>>>>>>> >>>>>>>>> but after I put in a compressor in sequenceFile, it fails, because >>>>>>>>> it read all the messages and yet still wants to advance the read >>>>>>>>> pointer, >>>>>>>>> and >>>>>>>>> then readTag() returns 0, so the mergeFrom() returns a message with >>>>>>>>> no fields set. >>>>>>>>> >>>>>>>>> anybody familiar with both SequenceFile and protocol buffer has an >>>>>>>>> idea why it fails like this? >>>>>>>>> I find it difficult to understand because the InputStream is simply >>>>>>>>> the same, whether it comes through a compressor or not >>>>>>>>> >>>>>>>>> >>>>>>>>> thanks >>>>>>>>> Yang >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "Protocol Buffers" group. >>>>>>>>> To post to this group, send email to proto...@googlegroups.com. >>>>>>>>> To unsubscribe from this group, send email to >>>>>>>>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com> >>>>>>>>> . >>>>>>>>> For more options, visit this group at >>>>>>>>> http://groups.google.com/group/protobuf?hl=en. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Protocol Buffers" group. >>>>>> To post to this group, send email to proto...@googlegroups.com. >>>>>> To unsubscribe from this group, send email to >>>>>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com> >>>>>> . >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/protobuf?hl=en. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Chris >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Protocol Buffers" group. >>>> To post to this group, send email to proto...@googlegroups.com. >>>> To unsubscribe from this group, send email to >>>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com> >>>> . >>>> For more options, visit this group at >>>> http://groups.google.com/group/protobuf?hl=en. >>>> >>> >>> >> > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.