Hi!

Long time no see, htdig people.  After catching up with the
mailing lists (phew!) it seems there's lots of good progress,
not the least being phrase searching, the test-suite and
compressing the DB.  Good to hear!
 Sorry for being absent for those questions that always come up
for your "old sins".  Here's a new one :-) :

I'd like to add a feature that came up when discussing the
ht://Dig use on gcc.gnu.org; to adjust the score of a hit
depending on the URL.  This way, hits in different "areas"
within a site can be weighed differently, to be reflected in the
result list.  I see this feature has also been requested by
others more than once, so it can't be a total waste to others.
(BTW, would this relate to stuff in the "PageRank" paper?)

I propose to implement it as follows:
An attribute "url_adjust_score" is added, containing a list of
regexes and values.
 Its contents is to be used in Searcher::score() after all other
scores have been computed, by looking up the url in a the list
of regexes, and adjust it according to the values.

I believe score adjustments such as this, generally should best
be done at the time of the *search* not at the time of indexing
(digging).  The rationale is that a site admin may very well
feel like changing the score-adjustments, for example when
adding a new web area.  There seems to only be a neglectible
performance win by doing the mapping when indexing, and it would
be quite awkward to have to re-index from scratch when changing
the adjustment rules.

Details:
The attribute "url_adjust_score" would be a list of "regex"
and "value" pairs, where the regex is to match the web area, and
the value is a simple text-formula for giving an additive
constant and a factor, thus giving a linear adjustment of the
score: (M * original_score) + N.
 The formula format would look like "*M[+]N", for example
"*.75-500".
That is, one added floating-point number preceded by (optionally)
plus or minus, and one multiplied factor (also a floating-point
number), preceded by an asterisk.  Spaces and comma to separate
the two parts could be allowed as sugar, if the whole format is
quoted.
 The presence of either part should not be significant.
One could think of it as an arithmetic expression "score =
docscore*M+N", but with the left part of the formula left out.

I believe using this type of formula is a lot better than
e.g. having the attribute value as a triple with two numeric
values: "webarea factor constant"; people would always mix up
the order of the two numbers.
 It is also more flexible than (e.g.) just having a factor to
multiply with: "webarea factor".  I guess the additive constant
might not be used a lot; finding out the right constant to add
seems awkward, as long as there are no specific guide to find
out the (reasonably) highest and lowest score a document can
get as a "hit".
(Sorry, I do not volunteer to write such a guideline ;-)

For the GCC project, there are the areas (minus those I forgot):
- Normal web-area; just plain "hand-written" documents.
- Area with mailing lists (under "/ml/").
- Area with online documentation ("/onlinedocs/").
- Faq-O-Matic area ("/fom_serv/").
- Testsuite-results area ("/testresults/")

So, if test-suite and mailing list results should be at the
bottom, and the FAQ and online documentation on top, you might
want (as a dreamed up example without afterthought):

url_adjust_score:       /ml/gcc-cvs *.5         \
                        /ml/ *.7                \
                        /testresults/ *.3-100   \
                        /fom_serv/ *1.1         \
                        /onlinedocs/ *1.1

I'm not sure how the list of pairs should be applied; either the
formulas for all matching items should be cascaded, or just the
first matching item picked.  I guess that using only the first
specified url pattern that matches makes most sense and will
give the least surprises.

In the example above, those mailing lists that are prefixed with
"/gcc-cvs" (gcc-cvs, gcc-cvs-wwwdocs), will then get its score
halved, and a hit in other parts of the mailing lists will get
its score slashed to 70%.  Test-results will be subtracted an
additional 100, because... I had to find a use for the constant
:-)

And yes, finding good values for the "formula" might be hard and
need experimentation for each use, but a linear mapping should
be all that is needed.

I can think of an alternative, more detailed way to specify
adjustments:  One could adjust each ..._factor attribute based
on the URL, using the new XML-like configuration format.  But
that looks a little too complicated for the average admin IMHO, and I do
not plan to implement it (right now):
 <url: /ml/
  header_factor: *.5
  text_factor: *.5
  ...
 </url>
Perhaps a generic "factor_adjust" only valid in the "url" context would
make sense here.

I plan to provide patches and check them in for the htdig-3-1-x
branch (in particular, 3.1.4) in addition to installing it on the
main trunk.

Comments, better-name suggestions, flames, screams in agony?

brgds, H-P


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to