Hi Arezki,

I will use Unix style notation for files, since that is more
comfortable for me, but I think the basic idea remains the same. For
any file (let's call it input.txt) you can find the Mutual Information
of the bigrams in it via the following two steps....

count.pl --ngram 2 output.txt input.txt

statistic.pl tmi output-mi.txt output.txt
OR
statistic.pl pmi output-mi.txt output.txt

Note that there are two ways of finding Mutual Information, one we
refer to as true mutual information (tmi) and the other as pointwise
mutual information (pmi). For finding collocations, pmi tends to be
what people are referring to when they talk about mutual information,
but we also provide tmi in the event someone means the more classical
definition of mutual information from information theory. The
differences between the two are described in the perldoc for each
measure....

http://search.cpan.org/dist/Text-NSP/lib/Text/NSP/Measures/2D/MI/pmi.pm
http://search.cpan.org/dist/Text-NSP/lib/Text/NSP/Measures/2D/MI/tmi.pm

There are also some handy options with count.pl that let you eliminate
stop words and things like that, but the above is the most basic way
to run things, and that's probably a good starting point.

The output from statistic.pl comes in sorted order. If you want to get
a list for each of your files, just run them separately. If you want
one big list for all of the files, you can specify as many input files
on the command line as you like, as in...

count.pl --ngram 2 output.txt input1.txt input2.txt input3.txt

Then you could run statistic.pl as described above...

I hope this all helps. Let us know if further questions arise.

Good luck!
Ted

On Fri, Aug 29, 2008 at 10:15 AM, arezki20002002
<[EMAIL PROTECTED]> wrote:
> HI Ted;
>
> I have a collection of text document "coll.txt"
> wich contain :
> D:\c.txt
> D:\e.txt
> D:\d.txt
> D:\a.txt
> D:\f.txt
> D:\g.txt
> D:\h.txt
> D:\j.txt
> How can I applied the MI mesure to extract the bigram in this
> collection and puting them in séparat file with their decresing scores.
>
> Best Regards
> Arezki
>
> 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to