Hi Geoff and ht://Diggers!

On Mon, 26 Feb 2001, Geoff Hutchison wrote:
> On Mon, 26 Feb 2001, Gilles Detillieux wrote:
>
> > > list".  That is - a heavier weight assigned to results found on the local
> > > server.  Is there a way to accomplish this?
> >
> > If I'm not mistaken, there's a configuration attribute in 3.2.0b3 that
> > allows you to tweak scrores based on the URL, so this is probably what you
> > want here.
>
> Yes, it's called url_seed_score:
> <http://www.htdig.org/dev/htdig-3.2/attrs.html#url_seed_score>

Experiments showed this feature was imperfect; it was hard to tweak to get
the wanted results order.  A better approach was to specify a list of
patterns (substrings, simple "htdig regexes" of the URL) in the wanted
order, controlled by a new attribute called search_results_order.  See the
attribute documentation (in the second patch).

> It was contributed by Hans-Peter Nilsson, and IIRC there was also a patch
> to the 3.1.x code as well, though I can't seem to find it (or his original
> messages) now. Hans-Peter, do you still have the patch?

Attached.

Also attached is the search_results_order patch, which I believe is not
imported into current sources, though I haven't checked - and I've been
away for *long* from htdig.  Note that the patch is for 3.1.4, and assumes
the url_seed_score patch is applied, IIRC.  Perhaps Some Day Someone will
port it to current sources.  Unless it's obsoleted by some other new
feature, of course. :-)

ChangeLog for url_seed_score:

Thu Jan  6 10:20:15 2000  Hans-Peter Nilsson  <[EMAIL PROTECTED]>

        * htdoc/attrs.html (url_seed_score): New.
        * htdoc/cf_byname.html: Added url_seed_score.
        * htdoc/cf_byprog.html: Ditto.

        * htcommon/defaults.cc (defaults): Add default for url_seed_score.

        * htlib/HtURLSeedScore.cc, HtURLSeedScore.h: New.

        * htsearch/Display.h (class Display: Add member minScore.
        Change maxScore type to double.

        * htsearch/Display.cc: Include math.h and HtURLSeedScore.h
        (Display constructor): Initialize minScore, change init value for
        maxScore to -DBL_MAX.
        (displayMatch): Use minScore in calculation of score to adjust for
        negative scores.
        (buildMatchList): Use an URLSeedScore to adjust the score after
        other calculations.
        Calculate minScore.
        Correct maxScore adjustment for change to double.
        (sort): Calculation of maxScore moved to buildMatchList.

ChangeLog for search_results_order:

Sun Jan 30 12:29:13 2000  Hans-Peter Nilsson  <[EMAIL PROTECTED]>

        * htdoc/attrs.html (search_results_order): New.
        * htdoc/cf_byname.html: Added search_results_order.
        * htdoc/cf_byprog.html: Ditto.

        * htcommon/defaults.cc (defaults): Add default for
        search_results_order.

        * htlib/List.h (List): New method AppendList.
        * htlib/List.cc (List::AppendList): Implement it.

        * htsearch/SplitMatches.h, htsearch/SplitMatches.cc: New.

        * htsearch/Display.cc: Include SplitMatches.h
        (buildMatchList): Use a SplitMatches to hold search results and
        interate over its parts when sorting scores.
        Ignore Count() of matches when setting minScore and maxScore.

brgds, H-P
PS.  Both patches should be also in the archived lists, perhaps with more
background.  Search for seed_score and search_results_order.
*** /dev/null   Tue Jan  1 05:00:00 1980
--- htlib/HtURLSeedScore.h      Fri Jan  7 09:13:08 2000
***************
*** 0 ****
--- 1,55 ----
+ //
+ // HtURLSeedScore.h
+ //
+ // URLSeedScore:  Constructed from a Configuration, see doc
+ // for format of config item "url_seed_score".
+ //  Method "double adjust_score(double score, const String &url)"
+ // returns an adjusted score, given the original score, or returns the
+ // original score if there was no adjustment to do.
+ //
+ // $Id$
+ //
+ // Part of the ht://Dig package   <http://www.htdig.org/>
+ // Copyright (c) 2000 The ht://Dig Group
+ // For copyright details, see the file COPYING in your distribution
+ // or the GNU Public License version 2 or later
+ // <http://www.gnu.org/copyleft/gpl.html>
+ //
+ #ifndef __HtURLSeedScore_h
+ #define __HtURLSeedScore_h
+ 
+ #include "Configuration.h"
+ #include "List.h"
+ 
+ class URLSeedScore
+ {
+ public:
+     URLSeedScore(Configuration &);
+     ~URLSeedScore();
+ 
+     // Return the "adjusted" score.  Use an inline method to avoid
+     // function-call overhead when this feature is unused.
+     double adjust_score(double score, const String& url)
+     {
+       return myAdjustmentList->Count() == 0
+           ? score : noninline_adjust_score(score, url);
+     }
+ 
+     // If an error was discovered during the parsing of
+     // the configuration, this member gives a
+     // nonempty String with an error message.
+     const String& ErrMsg() { return myErrMsg; }
+ 
+ private:
+     double noninline_adjust_score(double score, const String& url);
+ 
+     // These member functions are not supposed to be implemented.
+     URLSeedScore();
+     URLSeedScore(const URLSeedScore &);
+     void operator= (const URLSeedScore &);
+ 
+     List *myAdjustmentList;
+     String myErrMsg;
+ };
+ 
+ #endif /* __HtURLSeedScore_h */
*** /dev/null   Tue Jan  1 05:00:00 1980
--- htlib/HtURLSeedScore.cc     Fri Jan  7 09:13:08 2000
***************
*** 0 ****
--- 1,214 ----
+ //
+ // HtURLSeedScore.cc
+ //
+ // URLSeedScore:
+ //    Holds a list of configured adjustments to be applied on a given
+ //    score and given URL.
+ //
+ // Part of the ht://Dig package   <http://www.htdig.org/>
+ // Copyright (c) 2000 The ht://Dig Group
+ // For copyright details, see the file COPYING in your distribution
+ // or the GNU Public License version 2 or later
+ // <http://www.gnu.org/copyleft/gpl.html>
+ //
+ // $Id$
+ 
+ #include "StringList.h"
+ #include "StringMatch.h"
+ #include "HtURLSeedScore.h"
+ #include <stdio.h>
+ #include <ctype.h>
+ 
+ // This class is only used in private members of URLSeedScore.
+ // The OO-right thing would be to nest this inside the private
+ // declaration of HtURLSeedScore, but that would cause portability
+ // problems according to
+ // <URL:http://www.mozilla.org/hacking/portable-cpp.html#inner_classes>.
+ 
+ class ScoreAdjustItem : public Object
+ {
+ public:
+     // Construct from a string applicable to StringMatch, and a string to
+     // parse for a formula.
+     ScoreAdjustItem(String &, String &);
+ 
+     ~ScoreAdjustItem();
+ 
+     // Does this item match?
+     inline bool Match(const String &s) { return match.FindFirst(s.get()) != -1; }
+ 
+     // Return the argument adjusted according to this item.
+     double adjust_score(double orig)
+     { return orig*my_mul_factor + my_add_constant; }
+ 
+     // Error in parsing?  Message given here if non-empty string.
+     String& ErrMsg() { return myErrMsg; }
+ 
+ private:
+     double my_add_constant;
+     double my_mul_factor;
+     StringMatch match;
+ 
+     static String myErrMsg;
+ 
+     // These member functions are not supposed to be implemented, but
+     // mentioned here as private so the compiler will not generate them if
+     // someone puts in buggy code that would use them.
+     ScoreAdjustItem();
+     ScoreAdjustItem(const ScoreAdjustItem &);
+     void operator= (const ScoreAdjustItem &);
+ };
+ 
+ // Definition of myErrMsg.
+ String ScoreAdjustItem::myErrMsg("");
+ 
+ ScoreAdjustItem::ScoreAdjustItem(String &url_regex, String &formula)
+ {
+     double mul_factor = 1;
+     double add_constant = 0;
+     bool factor_found = false;
+     bool constant_found = false;
+     int chars_so_far;
+     match.Pattern(url_regex);
+ 
+     // FIXME: Missing method to check if the regex was in error.
+     // We'll check hasPattern for the time being as a placeholder.
+     if (! match.hasPattern())
+     {
+       myErrMsg = form("%s is not a valid regex", url_regex.get());
+       return;
+     }
+ 
+     char *s = formula.get();
+ 
+     // Parse the ([*]N[ ]*)?[+]?M format.
+     if (s[0] == '*')
+     {
+       // Skip past the '*'.
+       s++;
+ 
+       // There is a mul_factor.  Let's parse it.
+       chars_so_far = 0;
+       sscanf(s, "%lf%n", &mul_factor, &chars_so_far);
+ 
+       // If '%lf' failed to match, then it will show up as either no
+       // assignment to chars_so_far, or as writing 0 there.
+       if (chars_so_far == 0)
+       {
+           myErrMsg = form("%s is not a valid adjustment formula", s);
+           return;
+       }
+ 
+       // Skip past the number.
+       s += chars_so_far;
+ 
+       // Skip any whitespaces.
+       while (isspace(*s))
+           s++;
+ 
+       // Eat any plus-sign; it's redundant if alone, and may come before a
+       // minus.
+       if (*s == '+')
+           s++;
+ 
+       factor_found = true;
+     }
+ 
+     // If there's anything here, it must be the additive constant.
+     if (*s)
+     {
+       chars_so_far = 0;
+       sscanf(s, "%lf%n", &add_constant, &chars_so_far);
+ 
+       // If '%lf' failed to match, then it will show up as either no
+       // assignment to chars_so_far, or as writing 0 there.
+       //  We also need to check that it was the end of the input.
+       if (chars_so_far == 0 || s[chars_so_far] != 0)
+       {
+           myErrMsg = form("%s is not a valid adjustment formula",
+                           formula.get());
+           return;
+       }
+ 
+       constant_found = true;
+     }
+ 
+     // Either part must be there.
+     if (!factor_found && !constant_found)
+     {
+       myErrMsg = form("%s is not a valid formula", formula.get());
+       return;
+     }
+ 
+     my_add_constant = add_constant;
+     my_mul_factor = mul_factor;
+ }
+ 
+ ScoreAdjustItem::~ScoreAdjustItem()
+ {
+ }
+ 
+ URLSeedScore::URLSeedScore(Configuration &config)
+ {
+     char *config_item = "url_seed_score";
+ 
+     StringList sl(config[config_item], "\t \r\n");
+ 
+     myAdjustmentList = new List();
+ 
+     if (sl.Count() % 2)
+     {
+       myErrMsg = form("%s is not a list of pairs (odd number of items)",
+                       config_item);
+ 
+       // We *could* continue, but that just means the error will be harder
+       // to find, unless someone actually sees the error message.
+       return;
+     }
+ 
+     // Parse each as in TemplateList::createFromString.
+     for (int i = 0; i < sl.Count(); i += 2)
+     {
+       String url_regex = sl[i];
+       String adjust_formula = sl[i+1];
+ 
+       ScoreAdjustItem *adjust_item
+           = new ScoreAdjustItem(url_regex, adjust_formula);
+ 
+       if (adjust_item->ErrMsg().length() != 0)
+       {
+           // No point in continuing beyond the error; we might just
+           // overwrite the first error.
+           myErrMsg = form("While parsing %s: %s",
+                           config_item,
+                           adjust_item->ErrMsg().get());
+           return;
+       }
+ 
+       myAdjustmentList->Add(adjust_item);
+     }
+ }
+ 
+ URLSeedScore::~URLSeedScore()
+ {
+     delete myAdjustmentList;
+ }
+ 
+ double
+ URLSeedScore::noninline_adjust_score(double orig_score, const String &url)
+ {
+     List *adjlist = myAdjustmentList;
+     ScoreAdjustItem *adjust_item;
+ 
+     adjlist->Start_Get();
+ 
+     while ((adjust_item = (ScoreAdjustItem *) adjlist->Get_Next()))
+     {
+       // Use the first match only.
+       if (adjust_item->Match(url))
+           return adjust_item->adjust_score(orig_score);
+     }
+ 
+     // We'll get here if no match was found.
+     return orig_score;
+ }
Index: htcommon/defaults.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htcommon/defaults.cc,v
retrieving revision 1.43.2.12
diff -p -c -r1.43.2.12 defaults.cc
*** htcommon/defaults.cc        1999/12/06 22:26:46     1.43.2.12
--- htcommon/defaults.cc        2000/01/07 09:32:39
*************** ConfigDefaults  defaults[] =
*** 148,153 ****
--- 148,154 ----
      {"translate_amp",                   "false"},
      {"translate_lt_gt",                 "false"},
      {"translate_quot",                  "false"},
+     {"url_seed_score",                        ""},
      {"url_list",                      "${database_base}.urls"},
      {"url_part_aliases",                ""},
      {"url_log",                               "${database_base}.log"},
Index: htdoc/attrs.html
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdoc/attrs.html,v
retrieving revision 1.27.2.25
diff -p -c -r1.27.2.25 attrs.html
*** htdoc/attrs.html    1999/12/07 04:29:26     1.27.2.25
--- htdoc/attrs.html    2000/01/07 09:32:50
***************
*** 6816,6821 ****
--- 6816,6895 ----
        <hr>
        <dl>
          <dt>
+               <strong><a name="url_seed_score">url_seed_score</a></strong>
+         </dt>
+         <dd>
+               <dl>
+                 <dt>
+                       <em>type:</em>
+                 </dt>
+                 <dd>
+                       string list
+                 </dd>
+                 <dt>
+                       <em>used by:</em>
+                 </dt>
+                 <dd>
+                       <a href="htsearch.html">htsearch</a>
+                 </dd>
+                 <dt>
+                       <em>default:</em>
+                 </dt>
+                 <dd>
+                       <em>&lt;empty&gt;</em>
+                 </dd>
+                 <dt>
+                       <em>description:</em>
+                 </dt>
+                 <dd>
+                       This is a list of pairs, <em>pattern</em>
+                       <em>formula</em>, used to weigh the score of
+                       hits, depending on the URL of the document.<br>
+                       The <em>pattern</em> part is a substring to match
+                       against the URL.  Pipe ('|') characters can be
+                       used in the pattern to concatenate substrings for
+                       web-areas that have the same formula.<br>
+                       The formula describes a <em>factor</em> and a
+                       <em>constant</em>, by which the hit score is
+                       weighed.  The <em>factor</em> part is multiplied
+                       to the original score, then the <em>constant</em>
+                       part is added.<br>
+                       The format of the formula is the factor part:
+                       "*<em>N</em>" optionally followed by comma and
+                       spaces, followed by the constant part :
+                       "+<em>M</em>", where the plus sign may be emitted
+                       for negative numbers.  Either part is optional,
+                       but must come in this order.<br>
+                       The numbers <em>N</em> and <em>M</em> are floating
+                       point constants.<br>
+                       More straightforward is to think of the format as
+                       "newscore = oldscore*<em>N</em>+<em>M</em>",
+                       but with the "newscore = oldscore" part left out.
+                 </dd>
+                 <dt>
+                       <em>example:</em>
+                 </dt>
+                 <dd>
+                       <table border="0">
+                         <tr>
+                               <td valign="top">
+                                 url_seed_score:
+                               </td>
+                               <td nowrap>
+                                 /mailinglist/ *.5-1e6 \<br>
+                                 /docs/|/news/ *1.5 \<br>
+                                 /testresults/ "*.7 -200" \<br>
+                                 /faq-area/ *2+10000
+                               </td>
+                         </tr>
+                       </table>
+                 </dd>
+               </dl>
+         </dd>
+       </dl>
+       <hr>
+       <dl>
+         <dt>
                <strong><a name="use_meta_description">
                use_meta_description</a></strong>
          </dt>
Index: htdoc/cf_byname.html
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdoc/cf_byname.html,v
retrieving revision 1.18.2.13
diff -p -c -r1.18.2.13 cf_byname.html
*** htdoc/cf_byname.html        1999/12/06 22:26:48     1.18.2.13
--- htdoc/cf_byname.html        2000/01/07 09:32:51
***************
*** 176,181 ****
--- 176,182 ----
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#url_list">url_list</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#url_log">url_log</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#url_part_aliases">url_part_aliases</a><br>
+         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
+href="attrs.html#url_seed_score">url_seed_score</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#use_meta_description">use_meta_description</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#use_star_image">use_star_image</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#user_agent">user_agent</a><br>
Index: htdoc/cf_byprog.html
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdoc/cf_byprog.html,v
retrieving revision 1.17.2.13
diff -p -c -r1.17.2.13 cf_byprog.html
*** htdoc/cf_byprog.html        1999/12/06 22:26:48     1.17.2.13
--- htdoc/cf_byprog.html        2000/01/07 09:32:52
***************
*** 175,180 ****
--- 175,181 ----
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#syntax_error_file">syntax_error_file</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#uncoded_db_compatible">uncoded_db_compatible</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#url_part_aliases">url_part_aliases</a><br>
+         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
+href="attrs.html#url_seed_score">url_seed_score</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#use_meta_description">use_meta_description</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#use_star_image">use_star_image</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#valid_punctuation">valid_punctuation</a><br>
Index: htlib/Makefile.in
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htlib/Makefile.in,v
retrieving revision 1.13.2.2
diff -p -c -r1.13.2.2 Makefile.in
*** htlib/Makefile.in   1999/03/29 15:53:48     1.13.2.2
--- htlib/Makefile.in   2000/01/07 09:32:52
*************** OBJS=   Configuration.o Connection.o Datab
*** 16,22 ****
                URL.o URLTrans.o cgi.o \
                good_strtok.o io.o strcasecmp.o \
                strptime.o mytimegm.o HtCodec.o HtWordCodec.o \
!               HtURLCodec.o regex.o HtWordType.o
  
  TARGET=               libht.a
  
--- 16,22 ----
                URL.o URLTrans.o cgi.o \
                good_strtok.o io.o strcasecmp.o \
                strptime.o mytimegm.o HtCodec.o HtWordCodec.o \
!               HtURLCodec.o regex.o HtWordType.o HtURLSeedScore.o
  
  TARGET=               libht.a
  
Index: htsearch/Display.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htsearch/Display.cc,v
retrieving revision 1.54.2.22
diff -p -c -r1.54.2.22 Display.cc
*** htsearch/Display.cc 1999/12/07 16:52:35     1.54.2.22
--- htsearch/Display.cc 2000/01/07 09:32:56
*************** static char RCSid[] = "$Id: Display.cc,v
*** 21,28 ****
--- 21,30 ----
  #include <ctype.h>
  #include <syslog.h>
  #include <locale.h>
+ #include <math.h>
  #include "HtURLCodec.h"
  #include "HtWordType.h"
+ #include "HtURLSeedScore.h"
  
  //*****************************************************************************
  //
*************** Display::Display(char *indexFile, char *
*** 43,49 ****
      templateError = 0;
  
      maxStars = config.Value("max_stars");
!     maxScore = 100;
      setupImages();
      setupTemplates();
  
--- 45,52 ----
      templateError = 0;
  
      maxStars = config.Value("max_stars");
!     maxScore = -DBL_MAX;
!     minScore = DBL_MAX;
      setupImages();
      setupTemplates();
  
*************** Display::displayMatch(ResultMatch *match
*** 304,310 ****
  
      if (maxScore != 0)
        {
!       int percent = (int)(ref->DocScore() * 100 / (double)maxScore);
        if (percent <= 0)
          percent = 1;
        vars.Add("PERCENT", new String(form("%d", percent)));
--- 307,314 ----
  
      if (maxScore != 0)
        {
!       int percent = (int)((ref->DocScore() - minScore) * 100 /
!                           (maxScore - minScore));
        if (percent <= 0)
          percent = 1;
        vars.Add("PERCENT", new String(form("%d", percent)));
*************** Display::generateStars(DocumentRef *ref,
*** 742,748 ****
  
      if (maxScore != 0)
      {
!       score = ref->DocScore() / (double)maxScore;
      }
      else
      {
--- 746,752 ----
  
      if (maxScore != 0)
      {
!       score = (ref->DocScore() - minScore) / (maxScore - minScore);
      }
      else
      {
*************** Display::buildMatchList()
*** 938,943 ****
--- 942,951 ----
      double      backlink_factor = config.Double("backlink_factor");
      double      date_factor = config.Double("date_factor");
      SortType  typ = sortType();
+     URLSeedScore adjustments(config);
+ 
+     // If we knew where to pass it, this would be a good place to pass
+     // on errors from adjustments.ErrMsg().
        
      results->Start_Get();
      while ((id = results->Get_Next()))
*************** Display::buildMatchList()
*** 1007,1012 ****
--- 1015,1023 ----
                        sortRef->DocTitle(thisRef->DocTitle());
                    thisMatch->setRef(sortRef);
                  }
+ 
+               score = adjustments.adjust_score(score, thisRef->DocURL());
+ 
              }
            // Get rid of it to free the memory!
            delete thisRef;
*************** Display::buildMatchList()
*** 1019,1024 ****
--- 1030,1039 ----
        // Append this match to our list of matches.
        //
        matches->Add(thisMatch);
+       if (matches->Count() == 1 || maxScore < score)
+           maxScore = score;
+       if (matches->Count() == 1 || minScore > score)
+           minScore = score;
      }
  
      //
*************** Display::sort(List *matches)
*** 1163,1170 ****
      for (i = 0; i < numberOfMatches; i++)
      {
        array[i] = (ResultMatch *)(*matches)[i];
-       if (i == 0 || maxScore < array[i]->getScore())
-           maxScore = array[i]->getScore();
      }
      matches->Release();
  
--- 1178,1183 ----
Index: htsearch/Display.h
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htsearch/Display.h,v
retrieving revision 1.8.2.4
diff -p -c -r1.8.2.4 Display.h
*** htsearch/Display.h  1999/11/24 05:17:10     1.8.2.4
--- htsearch/Display.h  2000/01/07 09:32:57
*************** protected:
*** 125,131 ****
      // Maximum number of stars to display
      //
      int                       maxStars;
!     int                       maxScore;
  
      //
      // For display, we have different versions of the list of words.
--- 125,132 ----
      // Maximum number of stars to display
      //
      int                       maxStars;
!     double            maxScore;
!     double            minScore;
  
      //
      // For display, we have different versions of the list of words.

Compilation exited abnormally with code 1 at Fri Jan  7 11:21:03
diff -cprN ../htdig-3.1.4-with-url_seed_score/htcommon/defaults.cc 
./htcommon/defaults.cc
*** ../htdig-3.1.4-with-url_seed_score/htcommon/defaults.cc     Sun Jan 30 13:44:57 
2000
--- ./htcommon/defaults.cc      Sun Jan 30 13:51:35 2000
*************** ConfigDefaults  defaults[] =
*** 122,127 ****
--- 122,128 ----
      {"search_algorithm",              "exact:1"},
      {"search_results_footer",         "${common_dir}/footer.html"},
      {"search_results_header",         "${common_dir}/header.html"},
+     {"search_results_order",          ""},
      {"search_results_wrapper",                ""},
      {"server_aliases",                  ""},
      {"server_wait_time",                "0"},
diff -cprN ../htdig-3.1.4-with-url_seed_score/htdoc/attrs.html ./htdoc/attrs.html
*** ../htdig-3.1.4-with-url_seed_score/htdoc/attrs.html Sun Jan 30 13:44:57 2000
--- ./htdoc/attrs.html  Sun Jan 30 12:43:13 2000
***************
*** 5256,5261 ****
--- 5256,5317 ----
        <hr>
        <dl>
          <dt>
+               <strong><a name="search_results_order">
+               search_results_order</a></strong>
+         </dt>
+         <dd>
+               <dl>
+                 <dt>
+                       <em>type:</em>
+                 </dt>
+                 <dd>
+                       string list
+                 </dd>
+                 <dt>
+                       <em>used by:</em>
+                 </dt>
+                 <dd>
+                       <a href="htsearch.html" target="_top">htsearch</a>
+                 </dd>
+                 <dt>
+                       <em>default:</em>
+                 </dt>
+                 <dd>
+                       <em>&lt;empty&gt;</em>
+                 </dd>
+                 <dt>
+                       <em>description:</em>
+                 </dt>
+                 <dd>
+                       This specifies a list of patterns for URLs in
+                       search results.  Results will be displayed in the
+                       specified order, with the search algorithm result
+                       as the second order.  Remaining areas, that do not
+                       match any of the specified patterns, can be placed
+                       by using * as the pattern.  If no * is specified,
+                       one will be implicitly placed at the end of the
+                       list.<br>
+                       See also <a href="#url_seed_score">url_seed_score</a>.
+                 </dd>
+                 <dt>
+                       <em>example:</em>
+                 </dt>
+                 <dd>
+                       <table>
+                         <tr>
+                               <td nowrap>
+                                 search_results_order: /docs/|faq.html *
+                                 /maillist/ /testresults/
+                               </td>
+                         </tr>
+                       </table>
+                 </dd>
+               </dl>
+         </dd>
+       </dl>
+       <hr>
+       <dl>
+         <dt>
                <strong><a name="search_results_wrapper">
                search_results_wrapper</a></strong>
          </dt>
***************
*** 6864,6870 ****
                        point constants.<br>
                        More straightforward is to think of the format as
                        "newscore = oldscore*<em>N</em>+<em>M</em>",
!                       but with the "newscore = oldscore" part left out.
                  </dd>
                  <dt>
                        <em>example:</em>
--- 6920,6928 ----
                        point constants.<br>
                        More straightforward is to think of the format as
                        "newscore = oldscore*<em>N</em>+<em>M</em>",
!                       but with the "newscore = oldscore" part left out.<br>
!                       See also
!                       <a href="#search_results_order">search_results_order</a>.
                  </dd>
                  <dt>
                        <em>example:</em>
diff -cprN ../htdig-3.1.4-with-url_seed_score/htdoc/cf_byname.html 
./htdoc/cf_byname.html
*** ../htdig-3.1.4-with-url_seed_score/htdoc/cf_byname.html     Sun Jan 30 13:44:57 
2000
--- ./htdoc/cf_byname.html      Sun Jan 30 12:43:13 2000
***************
*** 142,147 ****
--- 142,148 ----
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_algorithm">search_algorithm</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_results_footer">search_results_footer</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_results_header">search_results_header</a><br>
+         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
+href="attrs.html#search_results_order">search_results_order</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_results_wrapper">search_results_wrapper</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#server_aliases">server_aliases</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#server_max_docs">server_max_docs</a><br>
diff -cprN ../htdig-3.1.4-with-url_seed_score/htdoc/cf_byprog.html 
./htdoc/cf_byprog.html
*** ../htdig-3.1.4-with-url_seed_score/htdoc/cf_byprog.html     Sun Jan 30 13:44:57 
2000
--- ./htdoc/cf_byprog.html      Sun Jan 30 12:43:13 2000
***************
*** 159,164 ****
--- 159,165 ----
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_algorithm">search_algorithm</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_results_footer">search_results_footer</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_results_header">search_results_header</a><br>
+         <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
+href="attrs.html#search_results_order">search_results_order</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#search_results_wrapper">search_results_wrapper</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#sort">sort</a><br>
          <img src="dot.gif" alt="*" width=9 height=9> <a target="body" 
href="attrs.html#sort_names">sort_names</a><br>
diff -cprN ../htdig-3.1.4-with-url_seed_score/htlib/List.cc ./htlib/List.cc
*** ../htdig-3.1.4-with-url_seed_score/htlib/List.cc    Fri Apr 16 20:47:40 1999
--- ./htlib/List.cc     Sun Jan 30 12:43:13 2000
*************** List &List::operator=(List &list)
*** 425,427 ****
--- 425,461 ----
  }
  
  
+ //*********************************************************************
+ // void AppendList(List &list)
+ //   Move contents of other list to the end of this list, and empty the
+ //   other list.
+ //
+ void List::AppendList(List &list)
+ {
+     // Never mind an empty list or ourselves.
+     if (list.number == 0 || &list == this)
+       return;
+ 
+     // Correct our pointers in head and tail.
+     if (tail)
+     {
+       // Link in other list.
+       tail->next = list.head;
+       list.head->prev = tail;
+ 
+       // Update members for added contents.
+       number += list.number;
+       tail = list.tail;
+     }
+     else
+     {
+       head = list.head;
+       tail = list.tail;
+       number = list.number;
+     }
+ 
+     // Clear others members to be an empty list.
+     list.head = list.tail = list.current = 0;
+     list.current_index = -1;
+     list.number = 0;
+ }
diff -cprN ../htdig-3.1.4-with-url_seed_score/htlib/List.h ./htlib/List.h
*** ../htdig-3.1.4-with-url_seed_score/htlib/List.h     Mon Feb  3 18:11:04 1997
--- ./htlib/List.h      Sun Jan 30 12:43:13 2000
*************** public:
*** 112,117 ****
--- 112,120 ----
      List              &operator= (List *list)         {return *this = *list;}
      List              &operator= (List &list);
  
+     // Move one list to the end of another, emptying the other list.
+     void              AppendList (List &list);
+ 
  protected:
      //
      // Pointers into the list
diff -cprN ../htdig-3.1.4-with-url_seed_score/htsearch/Display.cc 
./htsearch/Display.cc
*** ../htdig-3.1.4-with-url_seed_score/htsearch/Display.cc      Sun Jan 30 13:44:57 
2000
--- ./htsearch/Display.cc       Sun Jan 30 13:10:27 2000
*************** static char RCSid[] = "$Id: Display.cc,v
*** 25,30 ****
--- 25,31 ----
  #include "HtURLCodec.h"
  #include "HtWordType.h"
  #include "HtURLSeedScore.h"
+ #include "SplitMatches.h"
  
  //*****************************************************************************
  //
*************** Display::buildMatchList()
*** 938,944 ****
      char      *id;
      String    coded_url, url;
      ResultMatch       *thisMatch;
!     List      *matches = new List();
      double      backlink_factor = config.Double("backlink_factor");
      double      date_factor = config.Double("date_factor");
      SortType  typ = sortType();
--- 939,945 ----
      char      *id;
      String    coded_url, url;
      ResultMatch       *thisMatch;
!     SplitMatches matches(config);
      double      backlink_factor = config.Double("backlink_factor");
      double      date_factor = config.Double("date_factor");
      SortType  typ = sortType();
*************** Display::buildMatchList()
*** 1029,1048 ****
        //
        // Append this match to our list of matches.
        //
!       matches->Add(thisMatch);
!       if (matches->Count() == 1 || maxScore < score)
            maxScore = score;
!       if (matches->Count() == 1 || minScore > score)
            minScore = score;
      }
  
      //
!     // The matches need to be ordered by relevance level.
!     // Sort it.
      //
!     sort(matches);
  
!     return matches;
  }
  
  //*****************************************************************************
--- 1030,1054 ----
        //
        // Append this match to our list of matches.
        //
!       matches.Add(thisMatch, url.get());
! 
!       if (maxScore < score)
            maxScore = score;
!       if (minScore > score)
            minScore = score;
      }
  
      //
!     // Each sub-area is then sorted by relevance level.
      //
!     List *matches_part;  // Outside of loop to keep for-scope warnings away.
!     for (matches_part = matches.Get_First();
!        matches_part != 0;
!        matches_part = matches.Get_Next())
!       sort(matches_part);
  
!     // Then all sub-lists are concatenated and put in a new list.
!     return matches.JoinedLists();
  }
  
  //*****************************************************************************
diff -cprN ../htdig-3.1.4-with-url_seed_score/htsearch/Makefile.in 
./htsearch/Makefile.in
*** ../htdig-3.1.4-with-url_seed_score/htsearch/Makefile.in     Fri Apr 16 20:47:50 
1999
--- ./htsearch/Makefile.in      Sun Jan 30 12:43:13 2000
*************** include $(top_builddir)/Makefile.config
*** 9,15 ****
  
  OBJS=         Display.o DocMatch.o ResultList.o ResultMatch.o \
                Template.o TemplateList.o WeightWord.o htsearch.o \
!               parser.o
  
  FOBJS=                $(top_builddir)/htfuzzy/libfuzzy.a
  TARGET=               htsearch
--- 9,15 ----
  
  OBJS=         Display.o DocMatch.o ResultList.o ResultMatch.o \
                Template.o TemplateList.o WeightWord.o htsearch.o \
!               parser.o SplitMatches.o
  
  FOBJS=                $(top_builddir)/htfuzzy/libfuzzy.a
  TARGET=               htsearch
diff -cprN ../htdig-3.1.4-with-url_seed_score/htsearch/SplitMatches.cc 
./htsearch/SplitMatches.cc
*** ../htdig-3.1.4-with-url_seed_score/htsearch/SplitMatches.cc Thu Jan  1 01:00:00 
1970
--- ./htsearch/SplitMatches.cc  Sun Jan 30 12:43:13 2000
***************
*** 0 ****
--- 1,175 ----
+ //
+ // SplitMatches.cc
+ //
+ // SplitMatches:
+ //    Holds a list of lists with the matches, as specified in
+ //      search_results_order.
+ //
+ // Part of the ht://Dig package   <http://www.htdig.org/>
+ // Copyright (c) 2000 The ht://Dig Group
+ // For copyright details, see the file COPYING in your distribution
+ // or the GNU Public License version 2 or later
+ // <http://www.gnu.org/copyleft/gpl.html>
+ //
+ // $Id$
+ 
+ #include "StringList.h"
+ #include "StringMatch.h"
+ #include "SplitMatches.h"
+ #include <stdio.h>
+ #include <ctype.h>
+ 
+ // This class is only used in private members of SplitMatches.
+ // The OO-right thing would be to nest this inside the private
+ // declaration of SplitMatches, but that would cause portability
+ // problems according to
+ // <URL:http://www.mozilla.org/hacking/portable-cpp.html#inner_classes>.
+ //
+ // It is used as a container for a key (String) and a list.
+ //
+ class MatchArea : public Object
+ {
+ public:
+     // Construct from a string applicable to StringMatch.
+     MatchArea(const String &);
+ 
+     ~MatchArea();
+ 
+     // Does this item match?
+     inline bool Match(char *s)
+     { return match.hasPattern() && match.FindFirst(s) != -1; }
+ 
+     // Return the contained list.
+     List *MatchList() { return &myList; }
+ 
+ private:
+     StringMatch match;
+     List myList;
+ 
+     // These member functions are not supposed to be implemented, but
+     // mentioned here as private so the compiler will not generate them if
+     // someone puts in buggy code that would use them.
+     MatchArea();
+     MatchArea(const MatchArea &);
+     void operator= (const MatchArea &);
+ };
+ 
+ MatchArea::MatchArea(const String &url_regex)
+ {
+     // We do not want to "install" the catch-the-rest pattern as a real
+     // pattern; it must always return false for the "Match" operator.
+     if (strcmp("*", url_regex.get()) != 0)
+       match.Pattern(url_regex.get());
+ }
+ 
+ MatchArea::~MatchArea()
+ {
+ }
+ 
+ SplitMatches::SplitMatches(Configuration &config)
+ {
+     char *config_item = "search_results_order";
+ 
+     StringList sl(config[config_item], "\t \r\n");
+ 
+     mySubAreas = new List();
+     myDefaultList = 0;
+ 
+     // Parse each as in TemplateList::createFromString.
+     for (int i = 0; i < sl.Count(); i++)
+     {
+       String sub_area_pattern = sl[i];
+       MatchArea *match_item = new MatchArea(sub_area_pattern);
+       mySubAreas->Add(match_item);
+ 
+       // If this is the magic catch-rest sub-area-pattern, we want to
+       // use its list-pointer to store all URLs that do not match
+       // anything else.
+       //  We will iterate over a list where one of the patterns is
+       // known to not match, but that's a small penalty for keeping
+       // the code simple.
+       if (strcmp("*", sub_area_pattern.get()) == 0)
+           myDefaultList = match_item->MatchList();
+     }
+ 
+     // If we did not have a catch-the-rest pattern, install one at the
+     // end of the list.
+     if (myDefaultList == 0)
+     {
+       MatchArea *match_item = new MatchArea(String("*"));
+       mySubAreas->Add(match_item);
+ 
+       myDefaultList = match_item->MatchList();
+     }
+ }
+ 
+ SplitMatches::~SplitMatches()
+ {
+     // myDefaultList is a pointer to one of the items in mySubAreas and
+     // must not be explicitly deleted here.
+ 
+     delete mySubAreas;
+ }
+ 
+ void
+ SplitMatches::Add(ResultMatch *match, char *url)
+ {
+     List *area_list = mySubAreas;
+     MatchArea *area_item;
+ 
+     area_list->Start_Get();
+ 
+     // This is a linear search.  If there's a problem with that, we
+     // can improve it.  For now, a list with tens of areas seems lots,
+     // and break-even with a more clever search-scheme is probably in
+     // the hundreds.
+     while ((area_item = (MatchArea *) area_list->Get_Next()))
+     {
+       // Use the first match only.
+       if (area_item->Match(url))
+       {
+           area_item->MatchList()->Add(match);
+           return;
+       }
+     }
+ 
+     // We'll get here if no match was found, so we add to the
+     // catch-the-rest list.
+     myDefaultList->Add(match);
+ }
+ 
+ // Just a simple iterator function.
+ List *
+ SplitMatches::Get_Next()
+ {
+     MatchArea *next_area = (MatchArea *) mySubAreas->Get_Next();
+     List *next_area_list = 0;
+ 
+     if (next_area != 0)
+       next_area_list = next_area->MatchList();
+ 
+     return next_area_list;
+ }
+ 
+ // Rip out the sub-areas lists and concatenate them into one list.
+ List *
+ SplitMatches::JoinedLists()
+ {
+ 
+     // We make a new list here, so we don't have to worry about
+     // mySubAreas being dangling or null.
+     List *all_areas = new List();
+     List *sub_areas = mySubAreas;
+     MatchArea *area;
+ 
+     sub_areas->Start_Get();
+ 
+     while (area = (MatchArea *) sub_areas->Get_Next())
+     {
+       // "Destructively" move the contents of the list,
+       // leaving the original list empty.
+       all_areas->AppendList(*(area->MatchList()));
+     }
+ 
+     return all_areas;
+ }
diff -cprN ../htdig-3.1.4-with-url_seed_score/htsearch/SplitMatches.h 
./htsearch/SplitMatches.h
*** ../htdig-3.1.4-with-url_seed_score/htsearch/SplitMatches.h  Thu Jan  1 01:00:00 
1970
--- ./htsearch/SplitMatches.h   Sun Jan 30 12:43:13 2000
***************
*** 0 ****
--- 1,53 ----
+ //
+ // SplitMatches.h
+ //
+ // SplitMatches:  Constructed from a Configuration, see doc
+ // for format of config item "search_results_order".
+ //  Used to contain a number of ResultMatches, putting them in separate
+ // lists depending on the URL with method Add.
+ //  Iterator methods Get_First and Get_Next returns the sub-lists.
+ // Method Joined returns a new list with all the sub-lists
+ // concatenated.
+ //
+ // $Id$
+ //
+ // Part of the ht://Dig package   <http://www.htdig.org/>
+ // Copyright (c) 2000 The ht://Dig Group
+ // For copyright details, see the file COPYING in your distribution
+ // or the GNU Public License version 2 or later
+ // <http://www.gnu.org/copyleft/gpl.html>
+ //
+ #ifndef _splitmatches_h
+ #define _splitmatches_h
+ 
+ #include "Configuration.h"
+ #include "ResultMatch.h"
+ #include "List.h"
+ 
+ class SplitMatches
+ {
+ public:
+     SplitMatches(Configuration &);
+     ~SplitMatches();
+ 
+     void Add(ResultMatch *, char *);
+     List *JoinedLists();
+     List *Get_First()
+     { mySubAreas->Start_Get(); return Get_Next(); }
+ 
+     List *Get_Next();
+ 
+ private:
+     // These member functions are not supposed to be implemented.
+     SplitMatches();
+     SplitMatches(const SplitMatches &);
+     void operator= (const SplitMatches &);
+ 
+     // (Lists of) Matches for each sub-area regex.
+     List *mySubAreas;
+ 
+     // Matches for everything else.
+     List *myDefaultList;
+ };
+ 
+ #endif /* _splitmatches_h */

Reply via email to