Re: Language identifier plugin questions

Jérôme Charron Thu, 25 Aug 2005 14:05:51 -0700

Hi Tom,

I've been using the language identifier plugin, which I think is very 
> nice.


Tks ;-)

> I have a few questions which I hope someone might be able to
> answer:

I will trying to...

> 1. Why is the NGramProfile getSimilarity() method not called from
> LanguageIdentifier? 

It was used in Nutch-0.6.
Working on NUTCH-60 (Bad language identifier performances) for Nutch-0.7, I 
made a lot of changes in order to find the way(s) to improve performance and 
precision. One of these modification was to change the identify method.
You can find the benchs of these improvements on the wiki at 
http://wiki.apache.org/nutch/LanguageIdentifierBenchs
However, even if the getSimilarity method is no more used in Nutch code, I 
don't remove it because, for me it is a must have method.

2. The javadoc for the identify() method of LanguageIdentifier states
> that it returns null if the language is not recognised. However, the 
> implementation can never return null.

Oops. I don't update the javadoc (I will correct this on the 0.7 branch)

> Has this ever worked?

Yes, in Nutch-0.6 (it is a kind of regression)

> I think
> being able to recognise a "no match" case is an important part of the
> API (and would be easy to implement using a threshold value, if the 
> NGramProfile getSimilarity() method were being used).

It was how it worked in Nutch-0.6
There is already a discussion on this list on this point...
The solution I planned to implement for Nutch-0.7.1 (or Nutch-0.8) is to 
return an object (or an ordered array of objects) that gathers the 
identified language and its score. Then the responsability of applying a 
treshold is on the client side.

3. The identify(InputStream is) method of LanguageIdentifier (in SVN)
> assumes that the stream has a UTF-8 encoding, which will obviously 
> break for other encodings. Would it not be better to use a reader? So
> the signature would be:
> public String identify(Reader reader) throws IOException
> or add a charset argument:
> public String identify(InputStream is, String charsetName) throws 
> IOException 

Yes, there's some issues on this method (hopefully not used in nutch). 
There's the one you report here and a multi-byte char splitting problem 
reported by Piotr.
(it's one of my priorities to fix these problems for release 0.7.1)

Thanks for your comments.

Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: Language identifier plugin questions

Reply via email to