A user recently asked what the most frequent words are at the start of
a sentence in English. There is a relatively simple way to count up
such words for a given corpus using the Ngram Statistics Package.

First, start with a corpus where you know that the first word on a
line is the first word of a sentence. In my case I have used a version
of the GigaWords corpus that I have reformatted so that it has one
article per line, one line per article - note that this means we
aren't using all the sentences in GigaWords as a result (just the
first sentence in each article, in effect, but that still gives us
more than 4,000,000 first words, since that's the number of articles,
approximately).

The crucial step is this - I created a token file (called token.txt)
that looks like this and is just one line:

/^\w+/

This says that a token can only be an alphanumeric string (\w+) at the
start of a line (^). Everything else in your corpus is ignored, so
this actually goes pretty fast even for very very large corpora, and
will only count the first word on each line...

If my GigaWords files are in a directory, I can specify that directory
name on the command line and use the --recurse option to take all the
files in that directory (and any subdirectories) as input..

count.pl --recurse --ngram 1 --token token.txt  output gigawords-directory

The output file now contains a sorted list of all the words that occur
as the first word in a sentence in GigaWords...Here's the first few
lines of that (there are more than 60,000 types used to start the
4,000,000 articles in GigaWords it turns out...)

And here they are, the most frequent first words in articles in
GigaWords...this is perhaps a reasonable proxy for the first word in
sentences in English, although you will note some words that are
clearly very frequent due to the domain of GigaWords (which is
news...)

 4104905
THE<>605091
A<>236603
IN<>66781
FOLLOWING<>53228
HERE<>42938
FOR<>41349
AN<>41195
PRESIDENT<>39807
IT<>37652
U<>37518
CHINA<>34403
XFDWS<>28762
TWO<>28413
WITH<>28382
NEW<>27593
AS<>25196
WHEN<>24951
AT<>24083
AFTER<>21167
SOUTH<>20220
ATTENTION<>18403
CHINESE<>17965
JAPAN<>17613
FORMER<>16631
BY<>16270
ISRAELI<>16127
IF<>15839
RUSSIAN<>15757
POLICE<>15682
THERE<>14845
ONE<>13927
JAPANESE<>13570
THREE<>13556
ADDS<>13515
THESE<>13123
THIS<>12964
ISRAEL<>12913
MORE<>12880
ON<>12843
SOME<>12270
RUSSIA<>12145
HONG<>11960
US<>11678
PRIME<>11372
WORLD<>11317
FRENCH<>11114
I<>10724
LONDON<>10509
BRITISH<>10495
SHARE<>10362
AMERICAN<>10078
EDS<>8986
BRITAIN<>8744
HE<>8638
THEY<>8410
FOUR<>8338
FRANCE<>8267
NO<>8256
INDIA<>8031
GERMAN<>7751
C<>7670
ATTN<>7658
ABOUT<>7419
NATO<>7381
EUROPEAN<>7234
BUSH<>7129
JANUARY<>7026
PALESTINIAN<>6991
DESPITE<>6940
WHILE<>6854
NATIONAL<>6741
PARIS<>6682
DECEMBER<>6639
TO<>6579
RESULTS<>6556
THOUSANDS<>6542
TOKYO<>6445
AUSTRALIAN<>6433
STOCKS<>6363
IRAN<>6359
FIVE<>6304
GERMANY<>6271
IRAQ<>6124
WHAT<>6076
FROM<>6056
AUSTRALIA<>5933
Q<>5923
PAKISTAN<>5919
GOLD<>5761
PHILIPPINE<>5609
HUNDREDS<>5590
ART<>5463
SRI<>5431
EDITORS<>5350
SINGAPORE<>5338
YOU<>5331
WEATHER<>5325
TODAY<>5305
TOP<>5283
ALL<>5212
FOREIGN<>5154
WE<>5124
BUT<>5104
MALAYSIA<>5062
EVEN<>5020
DOW<>4974
WASHINGTON<>4938
ITALIAN<>4875
JUST<>4854
JOHN<>4852

The real power here is in the --token option, which can do a lot of
interesting things like this...

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to