Actually, that String/CharSequence is being placed on a java Class generated by Protobuf.
Which debugger would you suggest? Pardon, I'm a noob on native libraries. Thanks, -- Franz Allan Valencia See | Java Software Engineer franz....@gmail.com LinkedIn: http://www.linkedin.com/in/franzsee Twitter: http://www.twitter.com/franz_see On Tue, Mar 9, 2010 at 3:13 PM, Kenton Varda <ken...@google.com> wrote: > Protocol Buffers are binary data, not text. You can't store them in String > (or CharSequence) objects because those are meant only for Unicode text. If > CMeCab tries to transfer protobuf messages as Strings then it is, > unfortunately, broken. > > If you want to figure out how you are hitting that log message, you can run > in a debugger and insert a breakpoint at the file and line number shown in > the message. > > On Mon, Mar 8, 2010 at 10:19 PM, Franz Allan Valencia See < > franz....@gmail.com> wrote: > >> Good day, >> >> I am working on a java application which uses a 3rd party framework called >> CMeCab-Java. CMeCab-Java has two parts - the Java side & the Cpp side. One >> way to bridge the two which CMeCab-Java provides is via protobuf (and >> advantage of this approach over the other bridging approaches is that this >> is faster and easier to work with since what you get are objects and not >> streams). >> >> What CMeCab-Java does specifically is it accepts an input String >> (CharSequence to be exact) and tokenizes it using MeCab ( >> http://mecab.sourceforge.net/), and gives the result back as a Java >> object. >> >> While playing with it, i found that there were too many round trips >> between Java & the Cpp side. So what I am trying to do is to minimize that >> (and hopefully improve performance). Specifically, instead of passing the n >> number of texts, what I did was assembled this n texts into a single long >> text delimited by 0x00 (i.e. {'the', 'quick', 'brown', 'fox' } becomes 'the' >> + 0x00 + 'quick' + 0x00 + 'brown' + 0x00 + 'fox') and passed to the Cpp side >> that via protobuf. >> >> This works ok a single threaded application. However, once I multithread >> this request, I am getting the following error from protobuf (with gdb) >> which crashes the JVM: >> libprotobuf ERROR google/protobuf/wire_format.cc:1059] Encountered string >> containing invalid UTF-8 data while serializing protocol buffer. Strings >> must contain only UTF-8; use the 'bytes' type for raw bytes. >> >> But I am not sure why is that. Is there any flag I can turn on to see what >> this invalid UTF-8 data is and which string was it processing when I got >> that? (...Or is there any easier/better way for me to achieve the >> performance gains that I am looking for? :-) ) >> >> Thanks, >> -- >> Franz Allan Valencia See | Java Software Engineer >> franz....@gmail.com >> LinkedIn: http://www.linkedin.com/in/franzsee >> Twitter: http://www.twitter.com/franz_see >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Protocol Buffers" group. >> To post to this group, send email to proto...@googlegroups.com. >> To unsubscribe from this group, send email to >> protobuf+unsubscr...@googlegroups.com<protobuf%2bunsubscr...@googlegroups.com> >> . >> For more options, visit this group at >> http://groups.google.com/group/protobuf?hl=en. >> > > -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.