When experimenting with how url_seed_score works, I find that
I'm not completely happy with it.  It's a bit hard to use in
practice, since score-figures are so divergent - mostly for a
documented reason, mind you.
(Note that url_seed_score - renamed from url_adjust_score - is
not submitted or checked-in yet.)

Most people would probably just want the hits to appear in an
order depending on the web-area.  They probably do not want to
bother with finding the right magic constants with which to seed
the scores.
 To down-seed an area completely, means you have to find the
highest possible score and/or adjust the *_factor parameters.
A score easily goes up in the millions with a couple of
back-links and the searched-for word appearing in the title and
the first words.  Besides, seeding the score munges up the
actual score, making the easily understood "stars" and
"percentage" be inaccurate.

Now, I think url_seed_score is still useful, like when you
actually want to mix results from different areas together, just
slightly seed some areas.
 As a side-note to prospective users, small-figure factors and
constants should be used, if score figures are important.  It
seems they need to be kept at most in the thousands, or
scores go completely off.

For the just-order-in-these-areas use, I would like to propose
another more easily-used feature, controlled by an attribute
called (say) "results_order".
 It would simply take a list of regex:es and always order the
results according to that list, having the "normal" sort-order
as the second-order sort criteria.  Users of this attribute
might want to include it advanced search-forms (see
allow_in_form) as an option to be turned off, for searchers who
don't want someone else to dictate the order in which search
hits be served. :-)
Use of this attribute would look like:

 results_order: faq.html * /mailinglist/ /testresults/

Since you probably want to "move up" some areas in the results
list and "move down" others, you want to say where you want the
rest.
 This is expressed most intuitively (IMHO) as a lone "*".  That
character most often has no meaning used as a normal part of an
URL; it is not the catch-all regex ".*" and is seldom found as
part of an URL (is it even valid?).  If not specified in the
list, it defaults to be at the end of the list.
 And no, I can't think of a sane way to use ".*" in that list,
but it is a valid regex and as such should not be special-cased
IMHO.

For the example above, you always want hits in faq.html to
appear first.  It is probably a large document, so even if the
hit-score is low, it may be because the search-item is found at
the end of the document, but still is probably the document the
searcher is looking for.
 The area matched by /mailinglist/ is moved down, but still
before the lowly /testresults/ area.  As said, all other areas
come at the point of the "*".

I'll implement this for 3.1.4 (as a patch) and for main trunk
after moving the sorting to Searcher.cc.  I'll refrain from
doing this until the 3.2 changes are merged back on the main
trunk.  Before someone else says it: No, I *don't* want this to
go in 3.2.0  (unless everybody else thinks so).

BTW, Geoff; you said you were about to merge back 3.2 changes,
Would you rather do it yourself, or would you want help with that?

The implementation of results_order seems simple:  Wherever the
searching takes place, the results will be divided (or are already
divided) into lists separate for each area in results_order.  Then
the normal "sort" is applied for each list, then the lists are
concatenated to one, which is passed on for display-decorating and
output.

Comments welcome, as always.

brgds, H-P


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to