Hello Jérôme,
I found it and commited the fix. It was not using UTF-8 encoding sometimes.
But while looking at the code I feel a little bit worried about
LanguageIdentifier.identify(InputStream is) - as it reads bytes from file in chunks and coverts each chunk to stink separatelly. If multibyte UT-8 character is located at the chunk boundary it would would be split in two parts.
Am I right?

Regards
Piotr


Jérôme Charron wrote:
It works on my Linux box - with both JDK 1.4 and 1.5.


ok. so it seems to be constent with my conf.


I will try to track it down.


I assume it is an encoding problem of the Ngram profile files, but I have no time evening.
Regards

Jérôme



Reply via email to