tika-user  

Re: Can not filter out doc containing Chinese chars

Li Leon
Sun, 06 Dec 2009 18:24:36 -0800

With
$ java -jar tika-app-0.5.jar --text "Chinese Char.doc"
I ended up with
"??"

In my situation:
java -jar tika-app-0.5.jar -eunicode --text "Chinese Char.doc"
produced correct result
"在么"

All of above happened in a Windows environment during debugging. I spotted
the output in Visual Studio "Watch window" tool that supports displaying
UTF-8 encoding.

I just wonder why this is happened.


Thanks,




2009/12/4 Jukka Zitting <jukka.zitt...@gmail.com>

> Hi,
>
> 2009/12/4 Li Leon <leon800...@gmail.com>:
> > Out of interest, how did you get the output? Programmatically or command
> > line, if command line what command did you use.
>
>    $ java -jar tika-app-0.5.jar --text "Chinese Char.doc"
>
> BR,
>
> Jukka Zitting
>