[ 
https://issues.apache.org/jira/browse/TIKA-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Høydahl closed TIKA-496.
----------------------------
    Resolution: Won't Do

Closing as this is only a problem for the original TIKA langid which is 
superceded by better options.

> Language identifier profile comparison favors large profiles
> ------------------------------------------------------------
>
>                 Key: TIKA-496
>                 URL: https://issues.apache.org/jira/browse/TIKA-496
>             Project: Tika
>          Issue Type: Bug
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Priority: Major
>
> I think I've found a flaw in the distance algorithm.
> In LanguageProfile.java distance() method, we normalize the frequency for an 
> ngram by dividing by the total count.
> The total count for a profile is simply the sum of all counts in the profile.
> Problem is, that the .ngp files are cutoff at 1000 entries, and the total 
> count is then the sum of all those 1000 entries.
> However, there will be a long-tail of lower frequency ngrams which are cut 
> off and therefore not included in the total count.
> Effect is that the ngrams from profiles with large training set are more 
> important than ngrams from smaller training set.
> You can see this effect especially well when classifying short texts in a 
> language wich has similar sister languages with larger training sets. My 
> example is "no" vs "da".
> Sample from the tail of "no.ngp":
> _gå 461
> ask 461
> ria 459
> små 459
> ...and from the tail of "dk.ngp":
> dbr 966
> ost 966
> ævn 964
> It is obvious that "dk" has a longer tail after cutoff than "no" and 
> therefore a larger sum.
> A solution is to count the real total count when generating the .ngp file and 
> storing the total in the profile file itself, instead of counting when 
> loading the cutoff profile.
> Alterniatvely, normalize counts before writing the .ngp file, so that the top 
> entry is always 100000



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to