tika-user  

Re: Can not filter out doc containing Chinese chars

Li Leon
Thu, 03 Dec 2009 23:55:57 -0800

Thanks for the reply.

The problem being the Console application doesn't supports UTF-8 as I just
found out. Output to one which does support UTF-8 corrected the problem.

Out of interest, how did you get the output? Programmatically or command
line, if command line what command did you use.


Thanks,

2009/12/4 Jukka Zitting <jukka.zitt...@gmail.com>

> Hi,
>
> On Fri, Dec 4, 2009 at 4:04 AM, Li Leon <leon800...@gmail.com> wrote:
> > I'm using the following command to filter out the attached doc which is
> in
> > Chinese. The doc was filtered fine but only with gibberish output. Any
> > ideas?
>
> What's the exact output you see? The -x option makes Tika output by
> default UTF-8 encoded XHTML. Make sure your console or other tooling
> supports UTF-8. You can also explicitly specify the output encoding
> with the --encoding=... option.
>
> > "type "chinese char.doc" | java -jar "tika-app-0.4.jar" -x"
>
> With Tika 0.5 I get the following correct output: 在么
>
> BR,
>
> Jukka Zitting
>