Dear Friends
Many of you must be knowing this very well. But, still, the story, and more
than that, the sheer gem-like brilliance of the script made me tell it here,
once again. I got it in "Classic Shell Scripting" by Robbins and Beebe
(O'Reilly).
John Bentley of Bell Laboratories once posed this problem, reformulated by
David Hanson, "Given a text file and an integer 'n', you are to print the
words (and their frequency of occurrence in the text file) whose frequencies
of occurrence are among the 'n' largest in order of decreasing frequencey."
Computer Scientist Donald Knuth and David Hanson came up with "interesting and
clever literate programs, each of which took several hours to write". And now
came Mcllroy, or better 'the Patriarch Mcllroy', as Eric Raymonds called him
in his Master Foo Koans in "The Art of Unix Programming".
Mcllroy "offered a six step Unix solution that took only a couple of minutes
to develop and worked correctly the first time". So many more things can be
said about the script relating it to the core of Unix Philosophy, but let me
quote the gem of a script now:
<<<
#!/bin/sh
tr -cs A-Za-z\' '\n'|tr A-Z a-z|sort|uniq -c|sort -k1,1nr -k2|sed ${1:-25}q
>>>
Just save these two lines and name the script 'wf' and run it immediately with
'sh ./wf' on any text file and see the results.
I ran it on a whole 400 page book of mine and got quite interesting and funny
results. Let me quote here a few results that Robbins and Beebe got by
applying the script on Shakespeare's "Hamlet".
They wanted the first 12 highest frequency words, formatted into four-column
display by 'pr' with 'wf 12 < hamlet | pr -c4 -t -w80' with the result:
1148 the 671 of 550 a 451 in
970 and 635 i 514 my 419 it
771 to 554 you 454 hamlet 407 that
If someone is here for whom the workings of a script are not quite clear, let
me elaborate it a bit, Simmons and Beebe way of course. The first step:
tr -cs A-Za-z\' '\n'
Here, 'tr' replaces all non-letters (other than from A to Z and a to z) to
newlines and then its output is piped to the second step:
tr A-Z a-z
Here, 'tr' again simply changes all uppercase from A to Z into lowercase from
a to z and then this output is piped to the third step:
sort
Here the command 'sort' sorts the whole output alphabetically and pipes it to
the fourth step:
uniq -c
Here 'uniq' eliminates duplicates and keeps only one copy of each entry and
shows their counts and this output is piped to the fifth step:
sort -k1,1nr -k2
The command 'sort' again sorts the entries into a descending order, and then
by ascending word, and this output is piped to the sixth and final step:
sed ${1:-25}q
Here, the command 'sed' prints only the first given lines (default is 25).
For many of you this maybe quite household stuff, but, i got the excitement,
and i had to get a cathaxis, and Indra cannot blame me, i already tagged it
with [OT].
dipankar das
--
To unsubscribe, send mail to [EMAIL PROTECTED] with the body
"unsubscribe ilug-cal" and an empty subject line.
FAQ: http://www.ilug-cal.org/node.php?id=3