Re: string vs. bytes

dan . schmidt . valle Tue, 12 May 2009 10:43:14 -0700

Thanks very much for the answers guys. Most illustrative. The error
messages did in fact disappear with that simple change in all my proto
files.


Still, now that this error has shown in the code I have, I keep
wondering whether the fact that I'm serialising to string is
inefficient. What would be the case for using serialisation to a
stream then?

Thanks again for the help.

Dan

On May 12, 5:26 pm, Kenton Varda <ken...@google.com> wrote:
> Protocol Buffers has a "bytes" type.  That's what it's talking about.  Just
> change "string" to "bytes" in your .proto file.  (They work exactly the same
> in C++ but are different in Java and Python.)
>
> On Tue, May 12, 2009 at 6:47 AM, <dan.schmidt.va...@gmail.com> wrote:
>
> > I am having a very similar problem. Just installed the 2.0.3 version
> > and now all my serialisations complain.
>
> > libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
> > string containing invalid UTF-8 data while parsing protocol buffer.
> > Strings must contain only UTF-8; use the 'bytes' type for raw bytes.
>
> > Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
> > string is an array of those. So, what does it need? Would I be better
> > off serialising to a stream like the CodedStream?
>
> > I am very confused on the issue. I have the horrible feeling now that
> > I'm losing efficiency because serialising to string might mean that
> > I'm losing my raw data.
>
> > Otherwise, then the word ERROR on the output might be a bit too
> > strong.
>
> > If anybody can clarify, I'd be very grateful.
>
> > Dan
>
> > On May 10, 5:59 pm, Henner Zeller <h.zel...@acm.org> wrote:
> > > On Sun, May 10, 2009 at 6:08 AM, edan <edan...@gmail.com> wrote:
> > > > I have some fields that may contain non-UTF8 data.
> > > > I understand that I just need to change their type from "string" to
> > "bytes"
> > > > and it should just work, transparently.
>
> > > yes. The're the same on the wire.
>
> > > > I have a few fields that probably will only contain ASCII i.e. legal
> > UTF8,
> > > > but I'm not 100% sure.
> > > > I am tempted to just turn them all to "bytes".
> > > > But this begs the question - what is the "string" type useful for, and
> > why
> > > > shouldn't I just always use "bytes" to be sure, all the time, and not
> > both
> > > > with "string" at all?
> > > > Does "string" add anything besides validation that only valid UTF8 is
> > > > passing over the wire?  Is there really a big benefit to this
> > behavior?  Or
> > > > is there some other advantage that I'll miss out on by changing all my
> > > > "string"s to "bytes"?
>
> > > If you use the C++ api there is not much difference since both types
> > > are represented as std::string in the API. It makes a big difference
> > > for the Java API (and Python?), that have a native type for an UTF-8
> > > string. In Java, if you deal with a protocol buffer 'string' type, the
> > > generated API will return a java.lang.String while otherwise it will
> > > return a ByteString. ByteString can hold any character while the
> > > native Java String works only for UTF-8. So while 'ByteString' is more
> > > flexible, 'String' is more convenient to deal with within Java code
> > > because all string manipulation libraries can handle it.
>
> > > So the benefit is a more convenient Api in the generated Java code.
> > > And as well documentation: if you use 'string' you emphasize that a
> > > field only contains readable text while 'bytes' might contain any
> > > binary blob.
>
> > > -h
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: string vs. bytes

Reply via email to