A user recently asked what the most frequent words are at the start of a sentence in English. There is a relatively simple way to count up such words for a given corpus using the Ngram Statistics Package.
First, start with a corpus where you know that the first word on a line is the first word of a sentence. In my case I have used a version of the GigaWords corpus that I have reformatted so that it has one article per line, one line per article - note that this means we aren't using all the sentences in GigaWords as a result (just the first sentence in each article, in effect, but that still gives us more than 4,000,000 first words, since that's the number of articles, approximately). The crucial step is this - I created a token file (called token.txt) that looks like this and is just one line: /^\w+/ This says that a token can only be an alphanumeric string (\w+) at the start of a line (^). Everything else in your corpus is ignored, so this actually goes pretty fast even for very very large corpora, and will only count the first word on each line... If my GigaWords files are in a directory, I can specify that directory name on the command line and use the --recurse option to take all the files in that directory (and any subdirectories) as input.. count.pl --recurse --ngram 1 --token token.txt output gigawords-directory The output file now contains a sorted list of all the words that occur as the first word in a sentence in GigaWords...Here's the first few lines of that (there are more than 60,000 types used to start the 4,000,000 articles in GigaWords it turns out...) And here they are, the most frequent first words in articles in GigaWords...this is perhaps a reasonable proxy for the first word in sentences in English, although you will note some words that are clearly very frequent due to the domain of GigaWords (which is news...) 4104905 THE<>605091 A<>236603 IN<>66781 FOLLOWING<>53228 HERE<>42938 FOR<>41349 AN<>41195 PRESIDENT<>39807 IT<>37652 U<>37518 CHINA<>34403 XFDWS<>28762 TWO<>28413 WITH<>28382 NEW<>27593 AS<>25196 WHEN<>24951 AT<>24083 AFTER<>21167 SOUTH<>20220 ATTENTION<>18403 CHINESE<>17965 JAPAN<>17613 FORMER<>16631 BY<>16270 ISRAELI<>16127 IF<>15839 RUSSIAN<>15757 POLICE<>15682 THERE<>14845 ONE<>13927 JAPANESE<>13570 THREE<>13556 ADDS<>13515 THESE<>13123 THIS<>12964 ISRAEL<>12913 MORE<>12880 ON<>12843 SOME<>12270 RUSSIA<>12145 HONG<>11960 US<>11678 PRIME<>11372 WORLD<>11317 FRENCH<>11114 I<>10724 LONDON<>10509 BRITISH<>10495 SHARE<>10362 AMERICAN<>10078 EDS<>8986 BRITAIN<>8744 HE<>8638 THEY<>8410 FOUR<>8338 FRANCE<>8267 NO<>8256 INDIA<>8031 GERMAN<>7751 C<>7670 ATTN<>7658 ABOUT<>7419 NATO<>7381 EUROPEAN<>7234 BUSH<>7129 JANUARY<>7026 PALESTINIAN<>6991 DESPITE<>6940 WHILE<>6854 NATIONAL<>6741 PARIS<>6682 DECEMBER<>6639 TO<>6579 RESULTS<>6556 THOUSANDS<>6542 TOKYO<>6445 AUSTRALIAN<>6433 STOCKS<>6363 IRAN<>6359 FIVE<>6304 GERMANY<>6271 IRAQ<>6124 WHAT<>6076 FROM<>6056 AUSTRALIA<>5933 Q<>5923 PAKISTAN<>5919 GOLD<>5761 PHILIPPINE<>5609 HUNDREDS<>5590 ART<>5463 SRI<>5431 EDITORS<>5350 SINGAPORE<>5338 YOU<>5331 WEATHER<>5325 TODAY<>5305 TOP<>5283 ALL<>5212 FOREIGN<>5154 WE<>5124 BUT<>5104 MALAYSIA<>5062 EVEN<>5020 DOW<>4974 WASHINGTON<>4938 ITALIAN<>4875 JUST<>4854 JOHN<>4852 The real power here is in the --token option, which can do a lot of interesting things like this... Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse