We are running ht://Dig 3.2.0b4-011302 on a Red Hat 7.3 system, installed from the standard Red Hat RPMs. We have been using doc2html to parse PDFs and DOCs, with the following lines at the end of /etc/htdig.conf:

external_parsers: application/msword->text/html /usr/local/bin/doc2html.pl \
application/postscript->text/html /usr/local/bin/doc2html.pl \
application/pdf->text/html /usr/local/bin/doc2html.pl

The mystery is: How can we get htsearch to stop bunching all the .pdf and .doc files at the top of the results? For reasons unclear to me, all matching .pdf files are listed, then all the .docs files, and then all the .html files.

Our search algorithm and weighting factors are like this:

search_algorithm: exact:1 synonyms:0.2 endings:0.1

#backlink_factor: 1000.0
#date_factor: 0.00
#description_factor: 150
#heading_factor: 5.0
keywords_factor: 500
meta_description_factor: 100
#text_factor: 1
#title_factor: 100
heading_factor_1: 10
heading_factor_2: 5
heading_factor_3: 4
#heading_factor_4: 1
#heading_factor_5: 1
#heading_factor_6: 0


Any suggestions? (We're just about ready to give up indexing .pdf and .doc files altogether.)




-------------------------------------------------------
This SF.NET email is sponsored by: FREE SSL Guide from Thawte
are you planning your Web Server Security? Click here to get a FREE
Thawte SSL guide and find the answers to all your SSL security issues.
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to