[
https://issues.apache.org/jira/browse/TIKA-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl closed TIKA-496.
----------------------------
Resolution: Won't Do
Closing as this is only a problem for the original TIKA langid which is
superceded by better options.
> Language identifier profile comparison favors large profiles
> ------------------------------------------------------------
>
> Key: TIKA-496
> URL: https://issues.apache.org/jira/browse/TIKA-496
> Project: Tika
> Issue Type: Bug
> Components: languageidentifier
> Affects Versions: 0.7
> Reporter: Jan Høydahl
> Priority: Major
>
> I think I've found a flaw in the distance algorithm.
> In LanguageProfile.java distance() method, we normalize the frequency for an
> ngram by dividing by the total count.
> The total count for a profile is simply the sum of all counts in the profile.
> Problem is, that the .ngp files are cutoff at 1000 entries, and the total
> count is then the sum of all those 1000 entries.
> However, there will be a long-tail of lower frequency ngrams which are cut
> off and therefore not included in the total count.
> Effect is that the ngrams from profiles with large training set are more
> important than ngrams from smaller training set.
> You can see this effect especially well when classifying short texts in a
> language wich has similar sister languages with larger training sets. My
> example is "no" vs "da".
> Sample from the tail of "no.ngp":
> _gå 461
> ask 461
> ria 459
> små 459
> ...and from the tail of "dk.ngp":
> dbr 966
> ost 966
> ævn 964
> It is obvious that "dk" has a longer tail after cutoff than "no" and
> therefore a larger sum.
> A solution is to count the real total count when generating the .ngp file and
> storing the total in the profile file itself, instead of counting when
> loading the cutoff profile.
> Alterniatvely, normalize counts before writing the .ngp file, so that the top
> entry is always 100000
--
This message was sent by Atlassian Jira
(v8.20.10#820010)