Jukka Zitting
Thu, 03 Dec 2009 23:48:46 -0800
Hi, On Fri, Dec 4, 2009 at 4:04 AM, Li Leon <leon800...@gmail.com> wrote: > I'm using the following command to filter out the attached doc which is in > Chinese. The doc was filtered fine but only with gibberish output. Any > ideas?
What's the exact output you see? The -x option makes Tika output by default UTF-8 encoded XHTML. Make sure your console or other tooling supports UTF-8. You can also explicitly specify the output encoding with the --encoding=... option. > "type "chinese char.doc" | java -jar "tika-app-0.4.jar" -x" With Tika 0.5 I get the following correct output: 在么 BR, Jukka Zitting