According to Tomas Frydrych ([EMAIL PROTECTED]):
> Version: 3.1.5
>
> I need to add '+' to the list of valid word characters; after doing so htdig
> will index all words that contain '+' inside, but refuses to index words that
> start with '+' (and I suspect also words that end with it).
OK, I was able to reproduce the problem after all. I had limited my tests
before to htdig only, but the problem was in htmerge. It gives special
meaning to lines in the db.wordlist file that begin with "+", "-", and
"!", to mark document IDs that are unchanged, discarded or superceded.
Trouble is htmerge reads the wordlist assuming a valid word would never
begin with one of these, so its test for these is too liberal. Here's
a patch to correct the problem, so that you can add any of these three
special characters to extra_word_characters and allow words that begin
with one of them. Apply it in the htdig-3.1.5 main source directory using
"patch -p0 < this-message-file".
--- htmerge/words.cc.wordbug Thu Feb 24 20:29:11 2000
+++ htmerge/words.cc Fri Nov 24 09:54:27 2000
@@ -74,37 +74,40 @@ mergeWords(char *wordtmp, char *wordfile
//
while (fgets(buffer, sizeof(buffer), sorted))
{
- if (*buffer == '+')
+ //
+ // Split the line up into the word, count, location, and
+ // document id.
+ //
+ word = good_strtok(buffer, '\t');
+ pair = good_strtok(NULL, '\t');
+ if (!word || !*word || !pair || !*pair)
{
+ if (*buffer == '+')
+ {
//
// This tells us that the document hasn't changed and we
// are to reuse the old words
//
- }
- else if (*buffer == '-')
- {
+ }
+ else if (*buffer == '-')
+ {
if (removeBadUrls)
{
discard_list.Add(strtok(buffer + 1, "\n"), 0);
if (verbose)
cout << "htmerge: Removing doc #" << buffer + 1 << endl;
}
- }
- else if (*buffer == '!')
- {
+ }
+ else if (*buffer == '!')
+ {
discard_list.Add(strtok(buffer + 1, "\n"), 0);
if (verbose)
cout << "htmerge: doc #" << buffer + 1 <<
" has been superceeded." << endl;
+ }
}
else
{
- //
- // Split the line up into the word, count, location, and
- // document id.
- //
- word = good_strtok(buffer, '\t');
- pair = good_strtok(NULL, '\t');
wr.Clear(); // Reset count to 1, anchor to 0, and all that
sid = "-";
while (pair && *pair)
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>