Re: [protobuf] Debugging invalid UTF-8 data

Kenton Varda Tue, 09 Mar 2010 11:28:30 -0800

On Tue, Mar 9, 2010 at 12:27 AM, Franz Allan Valencia See <
[email protected]> wrote:


> Actually, that String/CharSequence is being placed on a java Class
> generated by Protobuf.
>

Ah.  Then I think the problem is simply that NUL characters are not allowed
in UTF-8 text.  So you need to find some other way to delimit your messages.


> Which debugger would you suggest? Pardon, I'm a noob on native libraries.


Depends on the platform and compiler.


>
>
> Thanks,
>
> --
> Franz Allan Valencia See | Java Software Engineer
> [email protected]
> LinkedIn: http://www.linkedin.com/in/franzsee
> Twitter: http://www.twitter.com/franz_see
>
> On Tue, Mar 9, 2010 at 3:13 PM, Kenton Varda <[email protected]> wrote:
>
>> Protocol Buffers are binary data, not text.  You can't store them in
>> String (or CharSequence) objects because those are meant only for Unicode
>> text.  If CMeCab tries to transfer protobuf messages as Strings then it is,
>> unfortunately, broken.
>>
>> If you want to figure out how you are hitting that log message, you can
>> run in a debugger and insert a breakpoint at the file and line number shown
>> in the message.
>>
>> On Mon, Mar 8, 2010 at 10:19 PM, Franz Allan Valencia See <
>> [email protected]> wrote:
>>
>>> Good day,
>>>
>>> I am working on a java application which uses a 3rd party framework
>>> called CMeCab-Java. CMeCab-Java has two parts - the Java side & the Cpp
>>> side. One way to bridge the two which CMeCab-Java provides is via protobuf
>>> (and advantage of this approach over the other bridging approaches is that
>>> this is faster and easier to work with since what you get are objects and
>>> not streams).
>>>
>>> What CMeCab-Java does specifically is it accepts an input String
>>> (CharSequence to be exact) and tokenizes it using MeCab (
>>> http://mecab.sourceforge.net/), and gives the result back as a Java
>>> object.
>>>
>>> While playing with it, i found that there were too many round trips
>>> between Java & the Cpp side. So what I am trying to do is to minimize that
>>> (and hopefully improve performance). Specifically, instead of passing the n
>>> number of texts, what I did was assembled this n texts into a single long
>>> text delimited by 0x00 (i.e. {'the', 'quick', 'brown', 'fox' } becomes 'the'
>>> + 0x00 + 'quick' + 0x00 + 'brown' + 0x00 + 'fox') and passed to the Cpp side
>>> that via protobuf.
>>>
>>> This works ok a single threaded application. However, once I multithread
>>> this request, I am getting the following error from protobuf (with gdb)
>>> which crashes the JVM:
>>> libprotobuf ERROR google/protobuf/wire_format.cc:1059] Encountered string
>>> containing invalid UTF-8 data while serializing protocol buffer. Strings
>>> must contain only UTF-8; use the 'bytes' type for raw bytes.
>>>
>>> But I am not sure why is that. Is there any flag I can turn on to see
>>> what this invalid UTF-8 data is and which string was it processing when I
>>> got that? (...Or is there any easier/better way for me to achieve the
>>> performance gains that I am looking for? :-) )
>>>
>>> Thanks,
>>> --
>>> Franz Allan Valencia See | Java Software Engineer
>>> [email protected]
>>> LinkedIn: http://www.linkedin.com/in/franzsee
>>> Twitter: http://www.twitter.com/franz_see
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "Protocol Buffers" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<protobuf%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/protobuf?hl=en.
>>>
>>
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Debugging invalid UTF-8 data

Reply via email to