tika-user  

Re: Can not filter out doc containing Chinese chars

Jukka Zitting
Thu, 03 Dec 2009 23:48:46 -0800

Hi,

On Fri, Dec 4, 2009 at 4:04 AM, Li Leon <leon800...@gmail.com> wrote:
> I'm using the following command to filter out the attached doc which is in
> Chinese. The doc was filtered fine but only with gibberish output. Any
> ideas?

What's the exact output you see? The -x option makes Tika output by
default UTF-8 encoded XHTML. Make sure your console or other tooling
supports UTF-8. You can also explicitly specify the output encoding
with the --encoding=... option.

> "type "chinese char.doc" | java -jar "tika-app-0.4.jar" -x"

With Tika 0.5 I get the following correct output: 在么

BR,

Jukka Zitting