Hi, On Tue, May 12, 2009 at 6:47 AM, <dan.schmidt.va...@gmail.com> wrote: > > I am having a very similar problem. Just installed the 2.0.3 version > and now all my serialisations complain. > > libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered > string containing invalid UTF-8 data while parsing protocol buffer. > Strings must contain only UTF-8; use the 'bytes' type for raw bytes. > > Now, C++ doesn't have a byte type. Just signed or unsigned chars, and > string is an array of those.
the ProtocolBuffer 'byte' type translates into 'string' in C++. And an array of chars is an array of bytes, so you're all fine. > So, what does it need? Would I be better > off serialising to a stream like the CodedStream? > > I am very confused on the issue. I have the horrible feeling now that > I'm losing efficiency because serialising to string might mean that > I'm losing my raw data. > > Otherwise, then the word ERROR on the output might be a bit too > strong. > > If anybody can clarify, I'd be very grateful. > > Dan > > On May 10, 5:59 pm, Henner Zeller <h.zel...@acm.org> wrote: >> On Sun, May 10, 2009 at 6:08 AM, edan <edan...@gmail.com> wrote: >> > I have some fields that may contain non-UTF8 data. >> > I understand that I just need to change their type from "string" to "bytes" >> > and it should just work, transparently. >> >> yes. The're the same on the wire. >> >> > I have a few fields that probably will only contain ASCII i.e. legal UTF8, >> > but I'm not 100% sure. >> > I am tempted to just turn them all to "bytes". >> > But this begs the question - what is the "string" type useful for, and why >> > shouldn't I just always use "bytes" to be sure, all the time, and not both >> > with "string" at all? >> > Does "string" add anything besides validation that only valid UTF8 is >> > passing over the wire? Is there really a big benefit to this behavior? Or >> > is there some other advantage that I'll miss out on by changing all my >> > "string"s to "bytes"? >> >> If you use the C++ api there is not much difference since both types >> are represented as std::string in the API. It makes a big difference >> for the Java API (and Python?), that have a native type for an UTF-8 >> string. In Java, if you deal with a protocol buffer 'string' type, the >> generated API will return a java.lang.String while otherwise it will >> return a ByteString. ByteString can hold any character while the >> native Java String works only for UTF-8. So while 'ByteString' is more >> flexible, 'String' is more convenient to deal with within Java code >> because all string manipulation libraries can handle it. >> >> So the benefit is a more convenient Api in the generated Java code. >> And as well documentation: if you use 'string' you emphasize that a >> field only contains readable text while 'bytes' might contain any >> binary blob. >> >> -h > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~----------~----~----~----~------~----~------~--~---