According to Tomas Frydrych ([EMAIL PROTECTED]):
> Version: 3.1.5
> 
> I need to add '+' to the list of valid word characters; after doing so htdig
> will index all words that contain '+' inside, but refuses to index words that
> start with '+' (and I suspect also words that end with it).

OK, I was able to reproduce the problem after all.  I had limited my tests
before to htdig only, but the problem was in htmerge.  It gives special
meaning to lines in the db.wordlist file that begin with "+", "-", and
"!", to mark document IDs that are unchanged, discarded or superceded.
Trouble is htmerge reads the wordlist assuming a valid word would never
begin with one of these, so its test for these is too liberal.  Here's
a patch to correct the problem, so that you can add any of these three
special characters to extra_word_characters and allow words that begin
with one of them.  Apply it in the htdig-3.1.5 main source directory using
"patch -p0 < this-message-file".

--- htmerge/words.cc.wordbug    Thu Feb 24 20:29:11 2000
+++ htmerge/words.cc    Fri Nov 24 09:54:27 2000
@@ -74,37 +74,40 @@ mergeWords(char *wordtmp, char *wordfile
     //
     while (fgets(buffer, sizeof(buffer), sorted))
     {
-       if (*buffer == '+')
+       //
+       // Split the line up into the word, count, location, and
+       // document id.
+       //
+       word = good_strtok(buffer, '\t');
+       pair = good_strtok(NULL, '\t');
+       if (!word || !*word || !pair || !*pair)
        {
+         if (*buffer == '+')
+         {
            //
            // This tells us that the document hasn't changed and we
            // are to reuse the old words
            //
-       }
-       else if (*buffer == '-')
-       {
+         }
+         else if (*buffer == '-')
+         {
            if (removeBadUrls)
            {
                discard_list.Add(strtok(buffer + 1, "\n"), 0);
                if (verbose)
                    cout << "htmerge: Removing doc #" << buffer + 1 << endl;
            }
-       }
-       else if (*buffer == '!')
-       {
+         }
+         else if (*buffer == '!')
+         {
            discard_list.Add(strtok(buffer + 1, "\n"), 0);
            if (verbose)
                cout << "htmerge: doc #" << buffer + 1 <<
                    " has been superceeded." << endl;
+         }
        }
        else
        {
-           //
-           // Split the line up into the word, count, location, and
-           // document id.
-           //
-           word = good_strtok(buffer, '\t');
-           pair = good_strtok(NULL, '\t');
            wr.Clear();   // Reset count to 1, anchor to 0, and all that
            sid = "-";
            while (pair && *pair)


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Reply via email to