Handy word list program for indexing

Steve Litt Mon, 05 Mar 2007 15:27:03 -0800

Hi all,

In preparation to create my index for my book, I created a Ruby program to 
list every word in a file (in this case the .lyx file).


Now of course this could be done with a simple one-liner using sed and 
sort -u, but my program lists the words in 2 different orders, first in alpha 
order, which of course could be done by the 1 liner, and then in descending 
order of occurrence, which can't be.

The order of occurrence is very handy because the most occurring words are 
usually garbage like the he, a and the like. Therefore, you can scan and 
delete those words very quickly.

The words used only once or twice comprise the majority of words, and because 
they're used only once or twice, they're typically not important and you can 
scan them very quickly.

The words in the middle typically contain many words useful in construction of 
an index, and should be perused more quickly.

My program, which is written in Ruby, is licensed GNU GPL version 2, and is 
included as the remainder of the body of this document. Have fun with it.

SteveT



#!/usr/bin/ruby
# Copyright (C) 2007 by Steve Litt, all rights reserved
# This program is licensed under the GNU GPL version 2 -- only version 2

require 'set'

$punct=Set.new([",", ".", "/", "<", ">", "?", ";", "'", ":", '"', "[", "]", 
"{", "}", "|", "`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", 
"+"])

def by_freq_then_name(a, b)
        if a[1] < b[1]
                return 1
        elsif a[1] > b[1]
                return -1
        elsif a[0] > b[0]
                return 1
        elsif a[0] < b[0]
                return -1
        else
                return 0
        end
end


word_hash = Hash.new()
word_hash['junk'] = 25
STDIN.each do
        |line|
        line.chomp!
        line.strip!
        temparr = line.split(/\s\s*/)
        temparr.each do
                |word|
                while word.length > 0 and $punct.include?(word[0].chr)
                        word = word[1..-1]
                end
                while word.length > 0 and $punct.include?(word[-1].chr) 
                        word = word[0..-2]
                end

                if word_hash.has_key?(word)
                        word_hash[word] += 1
                else
                        word_hash[word] = 1
                end
        end
end

puts "================================================="
puts "=============== ALPHA ORDER ====================="
puts "================================================="

keys = word_hash.keys.sort
keys.each do
        |key|
        printf "%24s %6d\n", key, word_hash[key]
end

puts "================================================="
puts "============ OCCURRENCE ORDER ==================="
puts "================================================="

temparray = word_hash.sort{|a,b| by_freq_then_name(a, b)}
temparray.each do
        |word_freq|
        printf "%7d   %s\n", word_freq[1], word_freq[0]
end

Handy word list program for indexing

Reply via email to