[ilug-cal] [OT] Doug Mcllroy's Gem of a Script

dipankar das Wed, 19 Apr 2006 03:24:42 -0700

Dear Friends

Many of you must be knowing this very well. But, still, the story, and more 
than that, the sheer gem-like brilliance of the script made me tell it here, 
once again. I got it in "Classic Shell Scripting" by Robbins and Beebe 
(O'Reilly).


John Bentley of Bell Laboratories once posed this problem, reformulated by 
David Hanson, "Given a text file and an integer 'n', you are to print the 
words (and their frequency of occurrence in the text file) whose frequencies 
of occurrence are among the 'n' largest in order of decreasing frequencey."

Computer Scientist Donald Knuth and David Hanson came up with "interesting and 
clever literate programs, each of which took several hours to write". And now 
came Mcllroy, or better 'the Patriarch Mcllroy', as Eric Raymonds called him 
in his Master Foo Koans in "The Art of Unix Programming". 

Mcllroy "offered a six step Unix solution that took only a couple of minutes 
to develop and worked correctly the first time". So many more things can be 
said about the script relating it to the core of Unix Philosophy, but let me 
quote the gem of a script now:
<<<
#!/bin/sh
tr -cs A-Za-z\' '\n'|tr A-Z a-z|sort|uniq -c|sort -k1,1nr -k2|sed ${1:-25}q
>>>
Just save these two lines and name the script 'wf' and run it immediately with 
'sh ./wf' on any text file and see the results. 

I ran it on a whole 400 page book of mine and got quite interesting and funny 
results. Let me quote here a few results that Robbins and Beebe got by 
applying the script on Shakespeare's "Hamlet". 

They wanted the first 12 highest frequency words, formatted into four-column 
display by 'pr' with 'wf 12 < hamlet | pr -c4 -t -w80' with the result:

1148 the                671 of          550 a           451 in
970 and         635 i           514 my          419 it
771 to          554 you         454 hamlet      407 that

If someone is here for whom the workings of a script are not quite clear, let 
me elaborate it a bit, Simmons and Beebe way of course. The first step: 
tr -cs A-Za-z\' '\n'
Here, 'tr' replaces all non-letters (other than from A to Z and a to z) to 
newlines and then its output is piped to the second step:
tr A-Z a-z 
Here, 'tr' again simply changes all uppercase from A to Z into lowercase from 
a to z and then this output is piped to the third step:
sort
Here the command 'sort' sorts the whole output alphabetically and pipes it to 
the fourth step:
uniq -c 
Here 'uniq' eliminates duplicates and keeps only one copy of each entry and 
shows their counts and this output is piped to the fifth step:
sort -k1,1nr -k2
The command 'sort' again sorts the entries into a descending order, and then 
by ascending word, and this output is piped to the sixth and final step:
sed ${1:-25}q
Here, the command 'sed' prints only the first given lines (default is 25).

For many of you this maybe quite household stuff, but, i got the excitement, 
and i had to get a cathaxis, and Indra cannot blame me, i already tagged it 
with [OT].

dipankar das
        

--
To unsubscribe, send mail to [EMAIL PROTECTED] with the body
"unsubscribe ilug-cal" and an empty subject line.
FAQ: http://www.ilug-cal.org/node.php?id=3

[ilug-cal] [OT] Doug Mcllroy's Gem of a Script

Reply via email to