Rob Kremer's bits of Thu, 18 Jul 2002 translated to:

>I am trying to determine if I have an incorrect configuration or if there is a
>bug.  When I search for an exact phrase, it will return some matches that don't
>have the exact match, such as searching for "htdig exclude" on the
>http://www.htdig.org site.  It will return 6 matches, only 2 have exact matches
>in them, the others are two ChangeLog files and two FAQ files.  Any ideas?

I believe that this is expected behavior. The word positions are
computed after parsing terms from the original text. That means
you need to account what happens with numbers, words less that
the specified minimum length, characters that break strings of
text into word parts, etc.

In this specific case, I think the extra hits you are seeing are
due to the following bits of text.

> htdig/htdig.cc: If exclude_urls

In this case, htdig.cc is split into htdig and cc. The cc is
dropped because it has less than three characters. The : is
tossed out for obvious reasons. The 'If' is less than three
characters. The exclude_url is split into exclude and url. The net
result is that you end up with an htdig next to an exclude.

> htdig/Retriever.cc,htdig/htdig.cc: "exclude_urls"

Basically the same as the above.

> htdig&restrict=&exclude

This occurs in both FAQ pages. This ends up being parsed as
htdig&restrict exclude, where htdig&restrict is treated as a two
part term (i.e. both htdig and restrict are given the same word
position). So the result is that 'htdig' has a word position
adjacent to 'exclude'.

Jim





-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to