Hi

Sadly, my math is weak but I will give it a try. Just make sure to
re-check :)

On Thu, Aug 06, 2015 at 11:29:05AM +0200, Daniel Naber wrote:
> we're using a bit probability theory to calculate ngram probabilities. 
> This way we can decide which word of a homophone pair like there/their 
> is (probably) correct. Is anybody here familiar with probability theory 
> and could review that code? The main part is here:
> 
> https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/java/org/languagetool/languagemodel/BaseLanguageModel.java#L41

(I updated the link since this mail is late...)

Below is the relevant function in its full form.

> + Probability getPseudoProbability(List<String> context) {
> +     int maxCoverage = 0;
> +     int coverage = 0;
> +     long firstWordCount = lm.getCount(context.get(0));
> +     maxCoverage++;

Off topic: This variable could be initalized to 1 directly on the first line of
the function.


> +     if (firstWordCount > 0) {
> +       coverage++;
> +     }
> +     // chain rule:

The chain rule is

P(A,B,C,...) = P(A) * P(B|A) * P(C|A, B) * ...

So the line below would be P(A)

> +     double p = (double) (firstWordCount + 1) / (totalTokenCount + 1);

which looks okay but (assuming you are going for Laplace-Add-one
smoothing) you would have to not add + 1 to "totalTokenCount" but the
vocabulary size for the n-gram model (== all unique n-grams which for
unigrams would mean all unique "syntactic words").

Another smoothing approach *may* work better.


> +     debug("    P for %s: %.20f (%d)\n", context.get(0), p, firstWordCount);
> +     for (int i = 2; i <= context.size(); i++) {
> +       List<String> subList = context.subList(0, i);
> +       long phraseCount = lm.getCount(subList);
> +       double thisP = (double) (phraseCount + 1) / (firstWordCount + 1);

This would be the place the conditional probabilities within the chain
are calculated. A conditional probability can be calculated as follows.

P(B|A) = P(A,B)/P(A)

Using the conditional probability of token "is" given token "this"
as an example it would look like this.

P("is"|"this") = P("this","is")/P("this")

where

P("this","is") = C("this is")/C(all 2-grams)

( C() denotes the count of the argument )

so I would have expected something like

+ double thisP = (double) ((Ngramcount + 1) / (countofallNgrams +
countofalluniqueNgrams)) / (countofN-1grams + countofalluniqueN-1grams);

Please note that one would have to adjust the n-gram-dependent counts
for different ns as n gets larger.

References:
* http://web.mit.edu/6.863/www/fall2012/readings/ngrampages.pdf
* "Foundations of Statistical Natural Language Processing" by Christopher D. 
Manning, Hinrich Schütze: 42f, 197, 202

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to