On 8/25/05, Jérôme Charron <[EMAIL PROTECTED]> wrote: > > > > 1. Why is the NGramProfile getSimilarity() method not called from > > LanguageIdentifier? > > It was used in Nutch-0.6. > Working on NUTCH-60 (Bad language identifier performances) for Nutch-0.7, > I > made a lot of changes in order to find the way(s) to improve performance > and > precision. One of these modification was to change the identify method. > You can find the benchs of these improvements on the wiki at > http://wiki.apache.org/nutch/LanguageIdentifierBenchs > However, even if the getSimilarity method is no more used in Nutch code, I > don't remove it because, for me it is a must have method.
I agree it is important to have the NGramProfile.getSimilarity() method. However, I think it is also important that it is consistent with the scoring that LanguageIdentifier uses, even if LanguageIdentifier optimises the implementation. Looking at the code I see that the two scoring methods are very different: 1. For identical profiles, NGramProfile scores 0, LanguageIdentifier scores 2. 2. For disjoint profiles (no common n-grams), NGramProfile scores 2, LanguageIdentifier scores 0 (actually Float.MIN_VALUE). Furthermore, there does not seem to be a simple way to transform one score to the other. Also, I wondered where the algorithm for the NGramProfile.getSimilarity() method came from. The Cavnar and Trenkle paper ( http://www.novodynamics.com/trenkle/papers/sdair-94-bc.ps.gz) uses the out-of-place measure, which seems to be different to the measure used here. There are other measures too (http://www.xs4all.nl/~ajwp/langident.pdf), but these look different as well! 2. The javadoc for the identify() method of LanguageIdentifier states > > that it returns null if the language is not recognised. However, the > > implementation can never return null. > > Oops. I don't update the javadoc (I will correct this on the 0.7 branch) > > > Has this ever worked? > > Yes, in Nutch-0.6 (it is a kind of regression) > > > I think > > being able to recognise a "no match" case is an important part of the > > API (and would be easy to implement using a threshold value, if the > > NGramProfile getSimilarity() method were being used). > > It was how it worked in Nutch-0.6 > There is already a discussion on this list on this point... > The solution I planned to implement for Nutch-0.7.1 (or Nutch-0.8) is to > return an object (or an ordered array of objects) that gathers the > identified language and its score. Then the responsability of applying a > treshold is on the client side. I think this would be a great improvement (if the scoring can be made to work well). It makes the API more Lucene-like, by providing a list of "hits" and letting the client decide relevance thresholds. 3. The identify(InputStream is) method of LanguageIdentifier (in SVN) > > assumes that the stream has a UTF-8 encoding, which will obviously > > break for other encodings. Would it not be better to use a reader? So > > the signature would be: > > public String identify(Reader reader) throws IOException > > or add a charset argument: > > public String identify(InputStream is, String charsetName) throws > > IOException > > Yes, there's some issues on this method (hopefully not used in nutch). > There's the one you report here and a multi-byte char splitting problem > reported by Piotr. > (it's one of my priorities to fix these problems for release 0.7.1) Excellent. Thanks for your comments. > > Regards > > Jérôme > > -- > http://motrech.free.fr/ > http://www.frutch.org/ > > Thanks, Tom
