Am 21.12.15 um 11:53 schrieb Christian Gollwitzer:
So for the spaces, either use a proper trainig material (some long
corpus from Wikipedia or such), with punctuation removed. Then it will
catch the correct probabilities at word boundaries. Or preprocess by
removing the spaces.

     Christian

PS: The real log-likelihood would become -infinity, when some pair does not appear at all in the training set (esp. the numbers, e.g.). I used the 1/total in the defaultdict to mitigate that. You could tweak that value a bit. The larger the corpus, the sharper it will divide by itself, too.

        Christian
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to