[ 
https://issues.apache.org/jira/browse/TIKA-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901980#action_12901980
 ] 

Jan Høydahl commented on TIKA-496:
----------------------------------

Well, Norway is not part of EU, so that document probably don't exist - the 
Norwegian corups is arguably shorter based on the .ngp, and there are not tests 
for "no" either.

> Language identifier profile comparison favors large profiles
> ------------------------------------------------------------
>
>                 Key: TIKA-496
>                 URL: https://issues.apache.org/jira/browse/TIKA-496
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>
> I think I've found a flaw in the distance algorithm.
> In LanguageProfile.java distance() method, we normalize the frequency for an 
> ngram by dividing by the total count.
> The total count for a profile is simply the sum of all counts in the profile.
> Problem is, that the .ngp files are cutoff at 1000 entries, and the total 
> count is then the sum of all those 1000 entries.
> However, there will be a long-tail of lower frequency ngrams which are cut 
> off and therefore not included in the total count.
> Effect is that the ngrams from profiles with large training set are more 
> important than ngrams from smaller training set.
> You can see this effect especially well when classifying short texts in a 
> language wich has similar sister languages with larger training sets. My 
> example is "no" vs "da".
> Sample from the tail of "no.ngp":
> _gå 461
> ask 461
> ria 459
> små 459
> ...and from the tail of "dk.ngp":
> dbr 966
> ost 966
> ævn 964
> It is obvious that "dk" has a longer tail after cutoff than "no" and 
> therefore a larger sum.
> A solution is to count the real total count when generating the .ngp file and 
> storing the total in the profile file itself, instead of counting when 
> loading the cutoff profile.
> Alterniatvely, normalize counts before writing the .ngp file, so that the top 
> entry is always 100000

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to