Re: string vs. bytes

Kenton Varda Tue, 12 May 2009 09:27:27 -0700

Protocol Buffers has a "bytes" type.  That's what it's talking about.  Just
change "string" to "bytes" in your .proto file.  (They work exactly the same
in C++ but are different in Java and Python.)
On Tue, May 12, 2009 at 6:47 AM, <dan.schmidt.va...@gmail.com> wrote:


>
> I am having a very similar problem. Just installed the 2.0.3 version
> and now all my serialisations complain.
>
> libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
> string containing invalid UTF-8 data while parsing protocol buffer.
> Strings must contain only UTF-8; use the 'bytes' type for raw bytes.
>
> Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
> string is an array of those. So, what does it need? Would I be better
> off serialising to a stream like the CodedStream?
>
> I am very confused on the issue. I have the horrible feeling now that
> I'm losing efficiency because serialising to string might mean that
> I'm losing my raw data.
>
> Otherwise, then the word ERROR on the output might be a bit too
> strong.
>
> If anybody can clarify, I'd be very grateful.
>
> Dan
>
> On May 10, 5:59 pm, Henner Zeller <h.zel...@acm.org> wrote:
> > On Sun, May 10, 2009 at 6:08 AM, edan <edan...@gmail.com> wrote:
> > > I have some fields that may contain non-UTF8 data.
> > > I understand that I just need to change their type from "string" to
> "bytes"
> > > and it should just work, transparently.
> >
> > yes. The're the same on the wire.
> >
> > > I have a few fields that probably will only contain ASCII i.e. legal
> UTF8,
> > > but I'm not 100% sure.
> > > I am tempted to just turn them all to "bytes".
> > > But this begs the question - what is the "string" type useful for, and
> why
> > > shouldn't I just always use "bytes" to be sure, all the time, and not
> both
> > > with "string" at all?
> > > Does "string" add anything besides validation that only valid UTF8 is
> > > passing over the wire?  Is there really a big benefit to this
> behavior?  Or
> > > is there some other advantage that I'll miss out on by changing all my
> > > "string"s to "bytes"?
> >
> > If you use the C++ api there is not much difference since both types
> > are represented as std::string in the API. It makes a big difference
> > for the Java API (and Python?), that have a native type for an UTF-8
> > string. In Java, if you deal with a protocol buffer 'string' type, the
> > generated API will return a java.lang.String while otherwise it will
> > return a ByteString. ByteString can hold any character while the
> > native Java String works only for UTF-8. So while 'ByteString' is more
> > flexible, 'String' is more convenient to deal with within Java code
> > because all string manipulation libraries can handle it.
> >
> > So the benefit is a more convenient Api in the generated Java code.
> > And as well documentation: if you use 'string' you emphasize that a
> > field only contains readable text while 'bytes' might contain any
> > binary blob.
> >
> > -h
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: string vs. bytes

Reply via email to