Li Leon
Thu, 03 Dec 2009 23:55:57 -0800
Thanks for the reply. The problem being the Console application doesn't supports UTF-8 as I just found out. Output to one which does support UTF-8 corrected the problem.
Out of interest, how did you get the output? Programmatically or command line, if command line what command did you use. Thanks, 2009/12/4 Jukka Zitting <jukka.zitt...@gmail.com> > Hi, > > On Fri, Dec 4, 2009 at 4:04 AM, Li Leon <leon800...@gmail.com> wrote: > > I'm using the following command to filter out the attached doc which is > in > > Chinese. The doc was filtered fine but only with gibberish output. Any > > ideas? > > What's the exact output you see? The -x option makes Tika output by > default UTF-8 encoded XHTML. Make sure your console or other tooling > supports UTF-8. You can also explicitly specify the output encoding > with the --encoding=... option. > > > "type "chinese char.doc" | java -jar "tika-app-0.4.jar" -x" > > With Tika 0.5 I get the following correct output: 在么 > > BR, > > Jukka Zitting >