> I found it and commited the fix. It was not using UTF-8 encoding 
> sometimes.

Thanks Piotr

> But while looking at the code I feel a little bit worried about
> LanguageIdentifier.identify(InputStream is) - as it reads bytes from
> file in chunks and coverts each chunk to stink separatelly. If multibyte
> UT-8 character is located at the chunk boundary it would would be split
> in two parts.
> Am I right?

Yes Piotr, you are right. It's a very good analysis.
Who said code review isn't useful? ;-)
Hopefully, this method is not used in nutch internals.
I will provide a correction as soon as possible.
Does someone know for a typical pattern for this?

Thanks again Piotr.

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Reply via email to