Re: Language identifier plugin questions

Tom White Tue, 30 Aug 2005 14:20:10 -0700

On 8/25/05, Jérôme Charron <[EMAIL PROTECTED]> wrote:
> 
> 
> > 1. Why is the NGramProfile getSimilarity() method not called from
> > LanguageIdentifier?
> 
> It was used in Nutch-0.6.
> Working on NUTCH-60 (Bad language identifier performances) for Nutch-0.7, 
> I
> made a lot of changes in order to find the way(s) to improve performance 
> and
> precision. One of these modification was to change the identify method.
> You can find the benchs of these improvements on the wiki at
> http://wiki.apache.org/nutch/LanguageIdentifierBenchs
> However, even if the getSimilarity method is no more used in Nutch code, I
> don't remove it because, for me it is a must have method.



I agree it is important to have the NGramProfile.getSimilarity() method. 
However, I think it is also important that it is consistent with the scoring 
that LanguageIdentifier uses, even if LanguageIdentifier optimises the 
implementation. Looking at the code I see that the two scoring methods are 
very different:

1. For identical profiles, NGramProfile scores 0, LanguageIdentifier scores 
2.
2. For disjoint profiles (no common n-grams), NGramProfile scores 2, 
LanguageIdentifier scores 0 (actually Float.MIN_VALUE).

Furthermore, there does not seem to be a simple way to transform one score 
to the other.

Also, I wondered where the algorithm for the NGramProfile.getSimilarity() 
method came from. The Cavnar and Trenkle paper (
http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz) uses the 
out-of-place measure, which seems to be different to the measure used here. 
There are other measures too (http://www.xs4all.nl/~ajwp/langident.pdf), but 
these look different as well!

2. The javadoc for the identify() method of LanguageIdentifier states
> > that it returns null if the language is not recognised. However, the
> > implementation can never return null.
> 
> Oops. I don't update the javadoc (I will correct this on the 0.7 branch)
> 
> > Has this ever worked?
> 
> Yes, in Nutch-0.6 (it is a kind of regression)
> 
> > I think
> > being able to recognise a "no match" case is an important part of the
> > API (and would be easy to implement using a threshold value, if the
> > NGramProfile getSimilarity() method were being used).
> 
> It was how it worked in Nutch-0.6
> There is already a discussion on this list on this point...
> The solution I planned to implement for Nutch-0.7.1 (or Nutch-0.8) is to
> return an object (or an ordered array of objects) that gathers the
> identified language and its score. Then the responsability of applying a
> treshold is on the client side.


I think this would be a great improvement (if the scoring can be made to 
work well). It makes the API more Lucene-like, by providing a list of "hits" 
and letting the client decide relevance thresholds.

3. The identify(InputStream is) method of LanguageIdentifier (in SVN)
> > assumes that the stream has a UTF-8 encoding, which will obviously
> > break for other encodings. Would it not be better to use a reader? So
> > the signature would be:
> > public String identify(Reader reader) throws IOException
> > or add a charset argument:
> > public String identify(InputStream is, String charsetName) throws
> > IOException
> 
> Yes, there's some issues on this method (hopefully not used in nutch).
> There's the one you report here and a multi-byte char splitting problem
> reported by Piotr.
> (it's one of my priorities to fix these problems for release 0.7.1)


Excellent.

Thanks for your comments.
> 
> Regards
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 
> 
Thanks,

Tom

Re: Language identifier plugin questions

Reply via email to