If you can, try installing 3.1.6 and run some tests. Also, before doing the installation, you might try sending the output from sample searches to text files and then compare them against similar searches in 3.1.6 to try and track down where the discrepancies are.
According to the notes from this thread in October, you might try running your HTML against some HTML validation service (make sure all tags are properly closed and such). If a tag is left open, it might not be getting read by htdig (this is just a guess, you'll have to check with the developers to see if this is, in fact, a possibility).
Good luck.
Ted Stresen-Reuter
On Monday, February 17, 2003, at 05:45 AM, David Adams wrote:
You have
keywords_factor: 500
Is it possible that the authors of the PDF documents have been diligent in
setting the keywords whilst the authors of the HTML pages have not bothered?
Just a thought.
David Adams
Corporate Information Services
Information Systems Services
University of Southampton
----- Original Message -----
From: "Michael Boer" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, February 14, 2003 11:18 PM
Subject: [htdig] pdf and doc hits sorted first in htsearch results?
We are running ht://Dig 3.2.0b4-011302 on a Red Hat 7.3 system, installed
fromthe standard Red Hat RPMs. We have been using doc2html to parse PDFs and
DOCs,with the following lines at the end of /etc/htdig.conf:
external_parsers: application/msword->text/html /usr/local/bin/doc2html.pl
\application/postscript->text/html/usr/local/bin/doc2html.pl \application/pdf->text/html /usr/local/bin/doc2html.pl
The mystery is: How can we get htsearch to stop bunching all the .pdf and
.docfiles at the top of the results? For reasons unclear to me, all matching
files are listed, then all the .docs files, and then all the .html files.
Our search algorithm and weighting factors are like this:
search_algorithm: exact:1 synonyms:0.2 endings:0.1
#backlink_factor: 1000.0
#date_factor: 0.00
#description_factor: 150
#heading_factor: 5.0
keywords_factor: 500
meta_description_factor: 100
#text_factor: 1
#title_factor: 100
heading_factor_1: 10
heading_factor_2: 5
heading_factor_3: 4
#heading_factor_4: 1
#heading_factor_5: 1
#heading_factor_6: 0
Any suggestions? (We're just about ready to give up indexing .pdf and
.docfiles altogether.)<[EMAIL PROTECTED]> with a subject of unsubscribe
-------------------------------------------------------
This SF.NET email is sponsored by: FREE SSL Guide from Thawte
are you planning your Web Server Security? Click here to get a FREE
Thawte SSL guide and find the answers to all your SSL security issues.
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to
FAQ: http://htdig.sourceforge.net/FAQ.html
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

