Re: [protobuf] Debugging invalid UTF-8 data

Franz Allan Valencia See Tue, 09 Mar 2010 00:27:33 -0800

Actually, that String/CharSequence is being placed on a java Class generated
by Protobuf.


Which debugger would you suggest? Pardon, I'm a noob on native libraries.

Thanks,

-- 
Franz Allan Valencia See | Java Software Engineer
franz....@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

On Tue, Mar 9, 2010 at 3:13 PM, Kenton Varda <ken...@google.com> wrote:

> Protocol Buffers are binary data, not text.  You can't store them in String
> (or CharSequence) objects because those are meant only for Unicode text.  If
> CMeCab tries to transfer protobuf messages as Strings then it is,
> unfortunately, broken.
>
> If you want to figure out how you are hitting that log message, you can run
> in a debugger and insert a breakpoint at the file and line number shown in
> the message.
>
> On Mon, Mar 8, 2010 at 10:19 PM, Franz Allan Valencia See <
> franz....@gmail.com> wrote:
>
>> Good day,
>>
>> I am working on a java application which uses a 3rd party framework called
>> CMeCab-Java. CMeCab-Java has two parts - the Java side & the Cpp side. One
>> way to bridge the two which CMeCab-Java provides is via protobuf (and
>> advantage of this approach over the other bridging approaches is that this
>> is faster and easier to work with since what you get are objects and not
>> streams).
>>
>> What CMeCab-Java does specifically is it accepts an input String
>> (CharSequence to be exact) and tokenizes it using MeCab (
>> http://mecab.sourceforge.net/), and gives the result back as a Java
>> object.
>>
>> While playing with it, i found that there were too many round trips
>> between Java & the Cpp side. So what I am trying to do is to minimize that
>> (and hopefully improve performance). Specifically, instead of passing the n
>> number of texts, what I did was assembled this n texts into a single long
>> text delimited by 0x00 (i.e. {'the', 'quick', 'brown', 'fox' } becomes 'the'
>> + 0x00 + 'quick' + 0x00 + 'brown' + 0x00 + 'fox') and passed to the Cpp side
>> that via protobuf.
>>
>> This works ok a single threaded application. However, once I multithread
>> this request, I am getting the following error from protobuf (with gdb)
>> which crashes the JVM:
>> libprotobuf ERROR google/protobuf/wire_format.cc:1059] Encountered string
>> containing invalid UTF-8 data while serializing protocol buffer. Strings
>> must contain only UTF-8; use the 'bytes' type for raw bytes.
>>
>> But I am not sure why is that. Is there any flag I can turn on to see what
>> this invalid UTF-8 data is and which string was it processing when I got
>> that? (...Or is there any easier/better way for me to achieve the
>> performance gains that I am looking for? :-) )
>>
>> Thanks,
>> --
>> Franz Allan Valencia See | Java Software Engineer
>> franz....@gmail.com
>> LinkedIn: http://www.linkedin.com/in/franzsee
>> Twitter: http://www.twitter.com/franz_see
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Protocol Buffers" group.
>> To post to this group, send email to proto...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/protobuf?hl=en.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Debugging invalid UTF-8 data

Reply via email to