On Thu, 26 Apr 2001, Tony Crockford wrote:
> > > > > There's also a "keywords" field for the htsearch form
> > > > > for exactly this purpose.
> >
> > My problem was that htdig seems to be stripping the "-"
> > (dash) character
> >> (Actually, it doesn't seem quite right to me that htsearch can't
> > find "foo-bar-baz", since, acc. to the doc for valid_punctuation,
> > "the same transformation is performed on the keywords the search
> > engine gets".  Maybe it doesn't apply to "keywords", but just to
> > "words"?)
>
> I had something similar, and the problem is that htdig does different
> stuff with the search form added keywords, to what it does with words
> typed into the form.  In fact I'm surprised you didn't get failures
> adding "-" to the htsearch form keywords.
>
> Having reread your mail, I think I'll step down and let someone more
> qualified explain the difference between:
>
> htdig-keywords added to meta tags in your documents and keywords added
> to the htsearch form.  Although I understand it, I don't feel up to
> explaining it!
>
> By the look of it you've got them muddled up too.

Let me describe what I've done in a bit more detail, for those who
might be trying to make sense of it.

First off, I'm using htdig-3.1.5.

I grabbed 15 documents from our website, and put them in a directory.
I gave each one a meta tag.  Here are the 15 tags I created (one in
each document):

<meta name="htdig-keywords" content="ir-cg">
<meta name="htdig-keywords" content="ir-cg-a">
<meta name="htdig-keywords" content="ir-cg-b">
<meta name="htdig-keywords" content="ir-cg-b-x">
<meta name="htdig-keywords" content="ir-cg-b-y">
<meta name="htdig-keywords" content="ir-cg-c">
<meta name="htdig-keywords" content="ir-cg-c-x">
<meta name="htdig-keywords" content="ir-cg-c-y">
<meta name="htdig-keywords" content="ir-cg-c-z">
<meta name="htdig-keywords" content="ir-cg-c-z-foo">
<meta name="htdig-keywords" content="ir-cg-c-z-bar">
<meta name="htdig-keywords" content="ir-cg-d-foo">
<meta name="htdig-keywords" content="ir-cg-d-bar">
<meta name="htdig-keywords" content="ir-cg-foo">
<meta name="htdig-keywords" content="ir-cg-foo-bar">

Next, I created a separate htdig.conf file so I could build an index on
just these 15 documents, then I ran htdig/htmerge.  I didn't specify
minimum_word_length (default is 3), so it didn't include the individual
subwords of length less than 3, such as "ir", "a", "b", "x", etc.  But
it did include, as Gilles described yesterday, all the various combinations
of subwords for each meta tag keyword; e.g. for "ir-cg-d-foo", it included
ir-cg-d-foo, ir-cg-d, ir-cg, cg-d-foo, cg-d, d-foo, and foo.  BUT, it
included them WITHOUT the "-", so if you look in the .wordlist.work file
afterwards, you find ircgdfoo, ircgd, ircg, cgdfoo, cgd, dfoo, foo.

Now I created a search form where I collect "words" with a
<input type="text"> and "keywords" by way of a <select name="keywords">
with a bunch of <option> tags like these:
<option value="ircg">ircg
<option value="ircga">ircga
<option value="ircgb">ircgb
<option value="ircgbx">ircgbx
<option value="ircgby">ircgby
(etc.)

If I enter in the text field a word which occurs in all 15 documents,
then in the result set, I get the number of documents at or below the
specified "keywords" value.  For example, if I select "ircg", then I
get all 15 documents in the result.  But if I select "ircgz", then it
returns 3 documents (that is, the ones with meta tags of ir-cg-c-z,
ir-cg-c-z-foo, and ir-cg-c-z-bar).  This is great!  This is exactly
what it should be doing.

Now the problem:  If in my <option> tags, I specify my keywords the
way I did in the <meta> tags, like this:
<option value="ir-cg">ir-cg
<option value="ir-cg-a">ir-cg-a
<option value="ir-cg-b">ir-cg-b
<option value="ir-cg-b-x">ir-cg-b-x
<option value="ir-cg-b-y">ir-cg-b-y
(etc.)

and then specify one of these in the search form's popup, and enter
the same word in the text field as before, then htsearch finds NO
matches.

Could this be a simple bug?  As I said in my last email, from my reading
of the the docs (e.g. under "valid_punctuation"), my impression is that
htsearch should strip out the "-" (which is included in the default
"valid_punctuation" set) before sending the keyword to the search engine.
I'm guessing it DOES remove the dash for the "words" field, but not for
the "keywords" field.

Here's another strange occurrence (bug?):
Even if my <option> tag values do not contain the dash character, if
I enter in the text field a "word" shorter than the minimum_word_length,
I get an "internal server error" from Apache, and I see the following
error in my error log:
"Premature end of script headers: /blah/blah/cgi-bin/htsearch"

This does not happen when the search does not include the "keywords"
field (e.g. if I include and select an <option value=""> tag).

---
Patrick Robinson
AHNR Info Technology, Virginia Tech
[EMAIL PROTECTED]

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to