Re: [protobuf] Debugging invalid UTF-8 data

Kenton Varda Mon, 08 Mar 2010 23:14:27 -0800

Protocol Buffers are binary data, not text.  You can't store them in String
(or CharSequence) objects because those are meant only for Unicode text.  If
CMeCab tries to transfer protobuf messages as Strings then it is,
unfortunately, broken.


If you want to figure out how you are hitting that log message, you can run
in a debugger and insert a breakpoint at the file and line number shown in
the message.

On Mon, Mar 8, 2010 at 10:19 PM, Franz Allan Valencia See <
franz....@gmail.com> wrote:

> Good day,
>
> I am working on a java application which uses a 3rd party framework called
> CMeCab-Java. CMeCab-Java has two parts - the Java side & the Cpp side. One
> way to bridge the two which CMeCab-Java provides is via protobuf (and
> advantage of this approach over the other bridging approaches is that this
> is faster and easier to work with since what you get are objects and not
> streams).
>
> What CMeCab-Java does specifically is it accepts an input String
> (CharSequence to be exact) and tokenizes it using MeCab (
> http://mecab.sourceforge.net/), and gives the result back as a Java
> object.
>
> While playing with it, i found that there were too many round trips between
> Java & the Cpp side. So what I am trying to do is to minimize that (and
> hopefully improve performance). Specifically, instead of passing the n
> number of texts, what I did was assembled this n texts into a single long
> text delimited by 0x00 (i.e. {'the', 'quick', 'brown', 'fox' } becomes 'the'
> + 0x00 + 'quick' + 0x00 + 'brown' + 0x00 + 'fox') and passed to the Cpp side
> that via protobuf.
>
> This works ok a single threaded application. However, once I multithread
> this request, I am getting the following error from protobuf (with gdb)
> which crashes the JVM:
> libprotobuf ERROR google/protobuf/wire_format.cc:1059] Encountered string
> containing invalid UTF-8 data while serializing protocol buffer. Strings
> must contain only UTF-8; use the 'bytes' type for raw bytes.
>
> But I am not sure why is that. Is there any flag I can turn on to see what
> this invalid UTF-8 data is and which string was it processing when I got
> that? (...Or is there any easier/better way for me to achieve the
> performance gains that I am looking for? :-) )
>
> Thanks,
> --
> Franz Allan Valencia See | Java Software Engineer
> franz....@gmail.com
> LinkedIn: http://www.linkedin.com/in/franzsee
> Twitter: http://www.twitter.com/franz_see
>
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to proto...@googlegroups.com.
> To unsubscribe from this group, send email to
> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Debugging invalid UTF-8 data

Reply via email to