You have

keywords_factor: 500

Is it possible that the authors of the PDF documents have been diligent in
setting the keywords whilst the authors of the HTML pages have not bothered?

Just a thought.

David Adams
Corporate Information Services
Information Systems Services
University of Southampton

----- Original Message -----
From: "Michael Boer" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, February 14, 2003 11:18 PM
Subject: [htdig] pdf and doc hits sorted first in htsearch results?


> We are running ht://Dig 3.2.0b4-011302 on a Red Hat 7.3 system, installed
from
> the standard Red Hat RPMs.  We have been using doc2html to parse PDFs and
DOCs,
> with the following lines at the end of /etc/htdig.conf:
>
> external_parsers: application/msword->text/html /usr/local/bin/doc2html.pl
\
>                    application/postscript->text/html
/usr/local/bin/doc2html.pl \
>                    application/pdf->text/html /usr/local/bin/doc2html.pl
>
> The mystery is:  How can we get htsearch to stop bunching all the .pdf and
.doc
> files at the top of the results?  For reasons unclear to me, all matching
.pdf
> files are listed, then all the .docs files, and then all the .html files.
>
> Our search algorithm and weighting factors are like this:
>
> search_algorithm:       exact:1 synonyms:0.2 endings:0.1
>
> #backlink_factor: 1000.0
> #date_factor: 0.00
> #description_factor:  150
> #heading_factor: 5.0
> keywords_factor: 500
> meta_description_factor: 100
> #text_factor: 1
> #title_factor: 100
> heading_factor_1: 10
> heading_factor_2: 5
> heading_factor_3: 4
> #heading_factor_4: 1
> #heading_factor_5: 1
> #heading_factor_6: 0
>
>
> Any suggestions?  (We're just about ready to give up indexing .pdf and
.doc
> files altogether.)
>
>
>
>
> -------------------------------------------------------
> This SF.NET email is sponsored by: FREE  SSL Guide from Thawte
> are you planning your Web Server Security? Click here to get a FREE
> Thawte SSL guide and find the answers to all your  SSL security issues.
> http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to
<[EMAIL PROTECTED]> with a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to