I'd really caution against relying on pre-compiled word lists - there
are a lot of subtle choices that go into creating such lists that you
should be aware of before you start using them. For example, do you
want stop words include or excluded? Do you want punctuation included
or excluded? Do you want everything captitalized or left as is?

The good news is that these things really are quite simple to control
in the Ngram Statistics Package. I'd really encourage you to try and
create these lists yourself - it's not that difficult, and it won't
take much time to learn. Then, you will have the advantage of being
able to control exactly what these lists consist of.

The simplest way to create a list (sorted by frequency) of ngrams is
just this...The files inputX.txt are the input, and the sorted bigrams
are stored in bigram-list.txt.

count.pl --ngram 2 bigram-list.txt input1.txt input2.txt

If you want to exclude stop words, you can add a stop list...

count.pl --ngram 2 --stop mystoplist.txt bigram-list.txt input1.txt input2.txt

If you want trigrams or four grams, you can just change the ngram
value to 3 or 4 and then proceed.

Most of this is explained fairly completely here :

The Design, Implementation, and Use of the Ngram Statistics Package
(Banerjee and Pedersen) - Appears in the Proceedings of the Fourth
International Conference on Intelligent Text Processing and
Computational Linguistics, pp. 370-381, February 17-21, 2003, Mexico
City.
http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

In any case, I'd really encourage you to at least try to create these
on your own, because in the end you'll be much happier with the
result, and you'll be able to change things much more easily.

That said, there are word lists like this available, and if someone
has created something I'm sure they'd be happy to share. However, it
would be helpful to explain why you need such a list and what you are
going to do with it (so they can judge if it is suitable or not for
your task).

Good luck!
Ted

On Mon, Dec 15, 2008 at 7:34 AM, Abhijit <abhijit8...@yahoo.com> wrote:
>
> Hi everybody,
>
> I dont know how to use this package by myself. Handicap there.
>
> However I requested Ted and am here because I need to the use the
> data that this package generates.
>
> I need to get hold of a somewhat usable sample database of phrase
> lists - most frequently occuring bigrams, trigrams and tetragrams
> that occur in the English language of common usage, not classical
> stuff.
>
> Could anyone share with me any such word list that you may have
> compiled using this package.
>
> Just need the above ngrams for contemporary English of popular usage.
>
> It doesnt have to be clean. I can clean it and give it back to you.
>
> Excel file would be great.
>
> Thank you in advance.
>
> Abhijit, India
>
> 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to