Hi all,
In preparation to create my index for my book, I created a Ruby program to
list every word in a file (in this case the .lyx file).
Now of course this could be done with a simple one-liner using sed and
sort -u, but my program lists the words in 2 different orders, first in alpha
order, which of course could be done by the 1 liner, and then in descending
order of occurrence, which can't be.
The order of occurrence is very handy because the most occurring words are
usually garbage like the he, a and the like. Therefore, you can scan and
delete those words very quickly.
The words used only once or twice comprise the majority of words, and because
they're used only once or twice, they're typically not important and you can
scan them very quickly.
The words in the middle typically contain many words useful in construction of
an index, and should be perused more quickly.
My program, which is written in Ruby, is licensed GNU GPL version 2, and is
included as the remainder of the body of this document. Have fun with it.
SteveT
#!/usr/bin/ruby
# Copyright (C) 2007 by Steve Litt, all rights reserved
# This program is licensed under the GNU GPL version 2 -- only version 2
require 'set'
$punct=Set.new([",", ".", "/", "<", ">", "?", ";", "'", ":", '"', "[", "]",
"{", "}", "|", "`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_",
"+"])
def by_freq_then_name(a, b)
if a[1] < b[1]
return 1
elsif a[1] > b[1]
return -1
elsif a[0] > b[0]
return 1
elsif a[0] < b[0]
return -1
else
return 0
end
end
word_hash = Hash.new()
word_hash['junk'] = 25
STDIN.each do
|line|
line.chomp!
line.strip!
temparr = line.split(/\s\s*/)
temparr.each do
|word|
while word.length > 0 and $punct.include?(word[0].chr)
word = word[1..-1]
end
while word.length > 0 and $punct.include?(word[-1].chr)
word = word[0..-2]
end
if word_hash.has_key?(word)
word_hash[word] += 1
else
word_hash[word] = 1
end
end
end
puts "================================================="
puts "=============== ALPHA ORDER ====================="
puts "================================================="
keys = word_hash.keys.sort
keys.each do
|key|
printf "%24s %6d\n", key, word_hash[key]
end
puts "================================================="
puts "============ OCCURRENCE ORDER ==================="
puts "================================================="
temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)}
temparray.each do
|word_freq|
printf "%7d %s\n", word_freq[1], word_freq[0]
end