Hi Kenneth, Thanks for the formula! Now, it's returning the usual perplexity values =)
Regards, Liling Message: 2 Date: Mon, 8 May 2017 10:15:52 +0100 From: Kenneth Heafield <[email protected]> Subject: Re: [Moses-support] Computing Perplexity with KenLM (Python API) To: [email protected] Message-ID: <[email protected]> Content-Type: text/plain; charset=windows-1252; format=flowed Hi Liling, You can test your program matches bin/query. None of these is correct. You want math.pow(10.0, sum_inv_logs / n) Kenneth --------------------------- On Mon, May 8, 2017 at 2:37 PM, liling tan <[email protected]> wrote: > Dear Moses Community, > > Does anyone know how to compute sentence perplexity with a KenLM model? > > Let's say we build a model on this: > > $ wget > https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt > $ lmplz -o 5 < something.txt > something.arpa > > > From the perplexity formula (https://web.stanford.edu/class/cs124/lec/ > languagemodeling.pdf) > > Applying the sum of inverse log formula to get the inner variable and then > taking the nth root, the perplexity number is unusually small: > > >>> import kenlm>>> m = kenlm.Model('something.arpa') > # Sentence seen in data.>>> s = 'The development of a forward-looking and > comprehensive European migration policy,'>>> list(m.full_scores(s)) > [(-0.8502398729324341, 2, False), (-3.0185394287109375, 3, False), > (-0.3004383146762848, 4, False), (-1.0249041318893433, 5, False), > (-0.6545327305793762, 5, False), (-0.29304179549217224, 5, False), > (-0.4497605562210083, 5, False), (-0.49850910902023315, 5, False), > (-0.3856896460056305, 5, False), (-0.3572353720664978, 5, False), > (-1.7523181438446045, 1, False)]>>> n = len(s.split())>>> sum_inv_logs = -1 * > sum(score for score, _, _ in m.full_scores(s))>>> math.pow(sum_inv_logs, > 1.0/n)1.2536033936438895 > > > Trying again with a sentence not found in the data: > > # Sentence not seen in data.>>> s = 'The European developement of a > forward-looking and comphrensive society is doh.'>>> sum_inv_logs = -1 * > sum(score for score, _, _ in m.full_scores(s))>>> > sum_inv_logs35.59524390101433>>> n = len(s.split())>>> math.pow(sum_inv_logs, > 1.0/n)1.383679905428275 > > > And trying again with totally out of domain data: > > >>> s = """On the evening of 5 May 2017, just before the French Presidential > >>> Election on 7 May, it was reported that nine gigabytes of Macron's > >>> campaign emails had been anonymously posted to Pastebin, a > >>> document-sharing site. In a statement on the same evening, Macron's > >>> political movement, En Marche!, said: "The En Marche! Movement has been > >>> the victim of a massive and co-ordinated hack this evening which has > >>> given rise to the diffusion on social media of various internal > >>> information""">>> sum_inv_logs = -1 * sum(score for score, _, _ in > >>> m.full_scores(s))>>> sum_inv_logs282.61719834804535>>> n = > >>> len(list(m.full_scores(s)))>>> n79>>> math.pow(sum_inv_logs, > >>> 1.0/n)1.0740582373271952 > > > > Although, it is expected that the longer sentence has lower perplexity, > it's strange that the difference is less than 1.0 and in the range of > decimals. > > Is the above the right way to compute perplexity with KenLM? If not, does > anyone know how to computer perplexity with the KenLM through the Python > API? > > Thanks in advance for the help! > > Regards, > Liling >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
