Hello Jérôme,
I found it and commited the fix. It was not using UTF-8 encoding sometimes.
But while looking at the code I feel a little bit worried about
LanguageIdentifier.identify(InputStream is) - as it reads bytes from
file in chunks and coverts each chunk to stink separatelly. If multibyte
UT-8 character is located at the chunk boundary it would would be split
in two parts.
Am I right?
Regards
Piotr
Jérôme Charron wrote:
It works on my Linux box - with both JDK 1.4 and 1.5.
ok. so it seems to be constent with my conf.
I will try to track it down.
I assume it is an encoding problem of the Ngram profile files, but I have no
time evening.
Regards
Jérôme