At 6:58 PM -0500 3/18/00, [EMAIL PROTECTED] wrote:
>Looking at documentation, it does not appear that there is any option in
>either the conf file or the parameters passed to htsearch, to limit the
>number of matches which are located and sorted. If "several thousand"
>documents match the specified words, all of these have to participate in
>sorting; there's no way to limit the number which participate.
This has been requested in the past. The biggest problem is that it's
a bit of a chicken-and-egg problem. You want to cut out the documents
before scoring and sorting (preferably before even looking them up in
the document DB). But before you have a ranking, you don't know which
ones you want to cut exactly. After all, you don't want to cut out
the best-ranked documents!
>Appears to me that I could inspect the .wordlist file produced by htdig,
>locate the records which are resulting in unwanted matches, and remove these
>prior to running htmerge.
Yes, you can do this. Another good technique is to use the cut and
sort command-line programs to count the frequency of the words and
add overused ones to the bad_words list. One reason for doing this is
that very common words add very little information value to a query.
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.