On Tue, Mar 9, 2010 at 12:27 AM, Franz Allan Valencia See < franz....@gmail.com> wrote:
> Actually, that String/CharSequence is being placed on a java Class > generated by Protobuf. > Ah. Then I think the problem is simply that NUL characters are not allowed in UTF-8 text. So you need to find some other way to delimit your messages. > Which debugger would you suggest? Pardon, I'm a noob on native libraries. Depends on the platform and compiler. > > > Thanks, > > -- > Franz Allan Valencia See | Java Software Engineer > franz....@gmail.com > LinkedIn: http://www.linkedin.com/in/franzsee > Twitter: http://www.twitter.com/franz_see > > On Tue, Mar 9, 2010 at 3:13 PM, Kenton Varda <ken...@google.com> wrote: > >> Protocol Buffers are binary data, not text. You can't store them in >> String (or CharSequence) objects because those are meant only for Unicode >> text. If CMeCab tries to transfer protobuf messages as Strings then it is, >> unfortunately, broken. >> >> If you want to figure out how you are hitting that log message, you can >> run in a debugger and insert a breakpoint at the file and line number shown >> in the message. >> >> On Mon, Mar 8, 2010 at 10:19 PM, Franz Allan Valencia See < >> franz....@gmail.com> wrote: >> >>> Good day, >>> >>> I am working on a java application which uses a 3rd party framework >>> called CMeCab-Java. CMeCab-Java has two parts - the Java side & the Cpp >>> side. One way to bridge the two which CMeCab-Java provides is via protobuf >>> (and advantage of this approach over the other bridging approaches is that >>> this is faster and easier to work with since what you get are objects and >>> not streams). >>> >>> What CMeCab-Java does specifically is it accepts an input String >>> (CharSequence to be exact) and tokenizes it using MeCab ( >>> http://mecab.sourceforge.net/), and gives the result back as a Java >>> object. >>> >>> While playing with it, i found that there were too many round trips >>> between Java & the Cpp side. So what I am trying to do is to minimize that >>> (and hopefully improve performance). Specifically, instead of passing the n >>> number of texts, what I did was assembled this n texts into a single long >>> text delimited by 0x00 (i.e. {'the', 'quick', 'brown', 'fox' } becomes 'the' >>> + 0x00 + 'quick' + 0x00 + 'brown' + 0x00 + 'fox') and passed to the Cpp side >>> that via protobuf. >>> >>> This works ok a single threaded application. However, once I multithread >>> this request, I am getting the following error from protobuf (with gdb) >>> which crashes the JVM: >>> libprotobuf ERROR google/protobuf/wire_format.cc:1059] Encountered string >>> containing invalid UTF-8 data while serializing protocol buffer. Strings >>> must contain only UTF-8; use the 'bytes' type for raw bytes. >>> >>> But I am not sure why is that. Is there any flag I can turn on to see >>> what this invalid UTF-8 data is and which string was it processing when I >>> got that? (...Or is there any easier/better way for me to achieve the >>> performance gains that I am looking for? :-) ) >>> >>> Thanks, >>> -- >>> Franz Allan Valencia See | Java Software Engineer >>> franz....@gmail.com >>> LinkedIn: http://www.linkedin.com/in/franzsee >>> Twitter: http://www.twitter.com/franz_see >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Protocol Buffers" group. >>> To post to this group, send email to proto...@googlegroups.com. >>> To unsubscribe from this group, send email to >>> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com> >>> . >>> For more options, visit this group at >>> http://groups.google.com/group/protobuf?hl=en. >>> >> >> > > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.