Re: Language identifier plugin questions

Jérôme Charron Wed, 31 Aug 2005 02:23:06 -0700

> 
> I agree it is important to have the NGramProfile.getSimilarity() method.
> However, I think it is also important that it is consistent with the 
> scoring
> that LanguageIdentifier uses, even if LanguageIdentifier optimises the
> implementation. Looking at the code I see that the two scoring methods are
> very different:


I don't think they are so different.
I don't check the consistency between the two scores, but I think it is 
simply an inverted function between each other.
But it's true, since the score of LanguageIdentifier is not visible from the 
client code, the consistency between the two score is a minor problem, but 
once the two scores will be available from the API, the consistency will be 
an important point.
Could you please create a JIRA issue about this point in order to keep it in 
mind for nutch-0.8. 

Also, I wondered where the algorithm for the NGramProfile.getSimilarity()
> method came from. The Cavnar and Trenkle paper (
> http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz) uses the
> out-of-place measure, which seems to be different to the measure used 
> here.
> There are other measures too (http://www.xs4all.nl/~ajwp/langident.pdf), 
> but
> these look different as well!

There's a lot of litterature about ngram and language identification, and a 
lot of way of building ngrams:
* what size of ngrams to uses? 
* Keeping special characters or not?
* ...
So, there's a lot of way of computing scores, depending on many criterion.
But I will take a look at those papers.
Concerning the the NGramProfile.getSimilarity() method, I don't know where 
it came from.
This method was originaly coded by Sami Siren.
Sami, could you please give the origin of your inspiration?

> I think this would be a great improvement (if the scoring can be made to
> work well). It makes the API more Lucene-like, by providing a list of 
> "hits"
> and letting the client decide relevance thresholds.

OK for me (to be added in a JIRA issue for next release).

> > Yes, there's some issues on this method (hopefully not used in nutch).
> > There's the one you report here and a multi-byte char splitting problem
> > reported by Piotr.
> > (it's one of my priorities to fix these problems for release 0.7.1)

Excellent.

Commited in nutch-0.7 branch
Regards

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/

Re: Language identifier plugin questions

Reply via email to