i don't think it would be that bad.

in 1996 in my former life as an IBMer we put up 30 million documents 
plus the OpenText web index up on the web. we were using OpenText's
"pat" engine "cpl" (aka "cpl") and some other full-text search software to run 
the
queries. if you haven't heard of these, don't feel bad. they're pretty 
unremarkable
and no longer exist. performance absolutely bit. partially becuase the interface
between the driver and the engine was xml. (go figure. tim bray was the big
man at opentext.) but the main reason was that the index was set up for
grep-like searches, doing the matching directly against the (patricia-tree'd) 
text.

i've come to think that's backwards. i think you should scan the corpus for a 
list if unique stemmed terms. (run running Run Run! all considered the same 
term)
and assign each term an index number. each document can be represnted as a 
string of unique index numbers which can be indexed using normal techniques.
a search would first convert the terms to index numbers and then do the search.
regular expressions could be applied to the /term/ list*, not the corpus.

you could prototype this with 3 tables (term_tab, doc_term_xref, doc_tab) 
from almost any databaseish thing that allows concurrent updates and queries.

obviously there's some generality lost. (proximity searches, whitespace/newline 
matching.)
but, i think this would get you 80% of what you would want at 20% of the 
complexity.

so many things to program, so little time.

- erik "mr vaporware" quanstrom

* my quick-and-dirty term check of my own email archive gives ~33000 terms.
(this is a big overcount.)


; cat */* | \
        tr 'A-Z' 'a-z' | \
        sed 's/[][;:"[EMAIL PROTECTED]&*()+=|\\/?<>,.]/ /g' | \
        grep -v '[^a-z0-9]$' | \
        awk '
{
        for (i=1; i<=NF;i++) {
                l = length($i);
                if (l>1 && l<15)
                        A[$i]++
        } 
}

END {
        n=0; 
        for(i in A) {
                n++; 
                printf("%d %s\n", A[i], i);
        }
        print n
}' |wc -l
33558

[EMAIL PROTECTED] writes

| i thought i'd write an external search algorithm - i'm most of the way
| through an extendable hash implementation (which seems simple and
| quick for insertion, but things get more complex when dealing with
| large values, and on deletion; i'm not sure of the best way to deal
| with block allocation; and more seriously, maybe it's essential to
| have an algorithm that can do range (e.g.  prefix) lookups).  any
| elegant (read *small*!), nicely implemented, open source libraries out
| there that might fit the bill?  a good description of an appropriate
| algorithm would do just as well...

Reply via email to