Hi Sadly, my math is weak but I will give it a try. Just make sure to re-check :)
On Thu, Aug 06, 2015 at 11:29:05AM +0200, Daniel Naber wrote: > we're using a bit probability theory to calculate ngram probabilities. > This way we can decide which word of a homophone pair like there/their > is (probably) correct. Is anybody here familiar with probability theory > and could review that code? The main part is here: > > https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/java/org/languagetool/languagemodel/BaseLanguageModel.java#L41 (I updated the link since this mail is late...) Below is the relevant function in its full form. > + Probability getPseudoProbability(List<String> context) { > + int maxCoverage = 0; > + int coverage = 0; > + long firstWordCount = lm.getCount(context.get(0)); > + maxCoverage++; Off topic: This variable could be initalized to 1 directly on the first line of the function. > + if (firstWordCount > 0) { > + coverage++; > + } > + // chain rule: The chain rule is P(A,B,C,...) = P(A) * P(B|A) * P(C|A, B) * ... So the line below would be P(A) > + double p = (double) (firstWordCount + 1) / (totalTokenCount + 1); which looks okay but (assuming you are going for Laplace-Add-one smoothing) you would have to not add + 1 to "totalTokenCount" but the vocabulary size for the n-gram model (== all unique n-grams which for unigrams would mean all unique "syntactic words"). Another smoothing approach *may* work better. > + debug(" P for %s: %.20f (%d)\n", context.get(0), p, firstWordCount); > + for (int i = 2; i <= context.size(); i++) { > + List<String> subList = context.subList(0, i); > + long phraseCount = lm.getCount(subList); > + double thisP = (double) (phraseCount + 1) / (firstWordCount + 1); This would be the place the conditional probabilities within the chain are calculated. A conditional probability can be calculated as follows. P(B|A) = P(A,B)/P(A) Using the conditional probability of token "is" given token "this" as an example it would look like this. P("is"|"this") = P("this","is")/P("this") where P("this","is") = C("this is")/C(all 2-grams) ( C() denotes the count of the argument ) so I would have expected something like + double thisP = (double) ((Ngramcount + 1) / (countofallNgrams + countofalluniqueNgrams)) / (countofN-1grams + countofalluniqueN-1grams); Please note that one would have to adjust the n-gram-dependent counts for different ns as n gets larger. References: * http://web.mit.edu/6.863/www/fall2012/readings/ngrampages.pdf * "Foundations of Statistical Natural Language Processing" by Christopher D. Manning, Hinrich Schütze: 42f, 197, 202 ------------------------------------------------------------------------------ Find and fix application performance issues faster with Applications Manager Applications Manager provides deep performance insights into multiple tiers of your business applications. It resolves application problems quickly and reduces your MTTR. Get your free trial! https://ad.doubleclick.net/ddm/clk/302982198;130105516;z _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel