> I found it and commited the fix. It was not using UTF-8 encoding > sometimes.
Thanks Piotr > But while looking at the code I feel a little bit worried about > LanguageIdentifier.identify(InputStream is) - as it reads bytes from > file in chunks and coverts each chunk to stink separatelly. If multibyte > UT-8 character is located at the chunk boundary it would would be split > in two parts. > Am I right? Yes Piotr, you are right. It's a very good analysis. Who said code review isn't useful? ;-) Hopefully, this method is not used in nutch internals. I will provide a correction as soon as possible. Does someone know for a typical pattern for this? Thanks again Piotr. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
