Re: [ngram] Re: 4-gram and 5-gram statistical analysis

Bridget Thomson McInnes Wed, 19 Nov 2008 07:02:52 -0800

Hi Rob,

I don't really remember there being any problems with the implementation. Ted 
might remember something though that I am not.


We originally had the idea of using the log likelihood to identify terms and/or 
collocations in documents at the time. The approach worked okay for term 
identification but what really ended up working was it was able to determine 
the syntactic structure of a term. Here is the paper that came out of that work 
if you are interested:

http://www-users.cs.umn.edu/~bthomson/publications/bionlp-acl07.pdf

I have a two modules that I wrote for NSP that do statistical analysis using 
the Log Likelihood measure for 4- and 5-grams that you are welcome to.

There are:

http://www-users.cs.umn.edu/~bthomson/tools/packages/measure.4gram.tar.gz
http://www-users.cs.umn.edu/~bthomson/tools/packages/measure.5gram.tar.gz

I have not used them in awhile so they have not been officially tested with the 
latest version of NSP. If you have any problems with them though please let me 
know and I can see what I can do.

If you create any others please let us know - I think I can say we would all be 
very interested :)

Thanks,

Bridget McInnes
[EMAIL PROTECTED]




On Wed, 19 Nov 2008, rob_koeling wrote:

> 
> Hello,
> 
> I've been playing around with the NSP package for the last couple of
> days, and I must say, I'm well impressed!
> 
> One of the things I would like to do is statistical analysis of 4- and
> 5-grams. I found this message from a few years ago, and before taking
> the plunge myself, I thought I would ask if anything had come of this
> work? If not, were there particular problems with the implementation
> that I should be aware of?
> 
> Best,
>
>   - Rob
> 
> 
> 
> --- In ngram@yahoogroups.com, bridget thomson <[EMAIL PROTECTED]> wrote:
>> 
>> Hello,
>> 
>> This is an email in response to Ted's and Jon's messages.
>> 
>> I have been working on extending the log likelihood measure in NSP
>> to allow for 4-grams and 5-grams. As Ted said, the log likelihood
>> ratio compares the observed and expected values for each cell in
>> your contingency table. Therefore, I have been implementing a way
>> for the user to determine which model that log likelihood
>> calculations should be based on.
>> 
>> For trigrams, there exists only four possibilities:
>> 
>> 1) p(word1, word2, word3) = p(word1) * p(word2) * p(word3)
>> 
>> 2) p(word1, word2, word3) = p(word1, word2) * p(word3)
>> 
>> 3) p(word1, word2, word3) = p(word1) * p(word2, word3)
>> 
>> 4) p(word1, word2, word3) = p(word1 word3) * p(word2)
>> 
>> Remember that these probabilities are the probability of the token(s)
>> occurring in their respective positions. So Model 4 would be
>> the probability word1 occurs in the first position, and word3
>> occurs in the second position multiplied the probability of word2
>> occurring in the second position.
>> 
>> This idea is similar for 4grams and 5grams, only the number of
>> combinations significantly increases. There exist 14 different
>> possible models for 4grams and 52 possible models for 5grams.
>> 
>> The score that log-likelihood produces reflects the degree to which the
>> observed and expected values diverge, and the higher that score, the
> less
>> the observed bigram looks like it is formed from two words that are
>> independent of each other.
>> 
>> The calculation of the expected values for these models have
>> an interesting characteristic. A case known as priori zeros
>> arises when calculating the expected values arise when calculating
>> the expected values based on a dependence between tokens in the
>> ngram. This happens the marginal totals of the observed values
>> are fixed by the size of the sample. Therefore the expected value
>> marginal total are required to be equal to the marginal total of
>> the observed values. Hence, estimating the expected values when the
>> model is not based on independence is slightly different.
>> 
>> The expected values for which one of the dependent tokens occurs
>> while the other one does not would be equal to zero. For example,
>> if we look at the trigram "New York Times" the number of times that
>> "New" occurs in the first position but "York" does not occur in the
>> second position would be zero based on Model 2. This is because the
>> basis for Model 2 is that the first and second tokens are dependent
>> and independent from the third token. Cells that do not contain this
>> constriction, are estimated using the following equation:
>>
>>       m_ijk = ( n_ijp * n_ppk ) / nppp
>> 
>> I hope this nomenclature is understandable. For our "New York Times"
>> example the expected value of the joint frequency would be the number
>> of times that "New" occurs in the first position and "York" occurs in
>> the second multiplied by the number of times "Times" occurs in the
>> third position divided by the total number of trigrams.
>> 
>> Analysis of this, would be similar to the analysis you would take
>> for Model 1 ( the independence model ) except that you do not include
>> the prior zero cells in your calculation.
>> 
>> I have/am doing log-linear tests on each of the models to determine
>> that they are reasonable to use for multi-word unit extraction. I fit
>> a specific model to the frequencies of a specified contingency table
>> using the following equations:
>> 
>> u_1(i) = ( sum_j^column ln(m_ij) ) / c
>>
>>        - sum_i^row sum_j^column ln(m_ij) / rc
>> 
>> u_2(j) = ( sum_i^row ln(m_ij) ) / r
>>
>>        - sum_i^row sum_j^column ln(m_ij) / rc
>> 
>> 
>> The results can be read by determining how far away from zero the
>> values are. A positive value indicates a positive association.
>> 
>> I did this for the ngram "New York Times" for trigrams and "New
>> York Times magazine" for 4grams because I was certain that
>> both ngrams are multi-word units and should be negatively
>> associated at the points where m111 are included. I
>> 
>> This way I can conduct my experiments (hopefully) with the
>> knowledge that the model that I have picked is 'reasonable'.
>> 
>> I should have them included in the NSP package by August. They are up,
>> tested and running but I have a bit of documentation that I have to
>> finish before they can go out. The statistics package will be able to
>> run exactly as before. So unless you want to use a model outside of
>> the independence model, you should not notice any change. If you
>> would like to use a different model, this can be specified in a file
>> similar to how the ordering of the marginal values are specified.
>> 
>> For example, with the independence model for trigrams, the expected
>> values would be ordered as such in a
>> 
>> 0
>> 1
>> 2
>> 
>> For the "p(word1 word2) * p(word3)" model:
>> 
>> 0 1
>> 2
>> 
>> and so on. Well, I hope this makes sense and if you have any questions
>> or I have not explained something clearly, please let me know.
>> 
>> Thanks!
>> 
>> Bridget
>> 
>> 
>> 
>> On Tue, 6 Jul 2004, theonomo wrote:
>> 
>>> Hello all.
>>> 
>>> I would like to calculate statistical measures of association for 4-
>>> grams and 5-grams.  Is this possible?  Does this even make sense?
>>> 
>>> Thanks,
>>> 
>>> Jon
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Yahoo! Groups Links
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
>

Re: [ngram] Re: 4-gram and 5-gram statistical analysis

Reply via email to