Re: [htdig] pdf and doc hits sorted first in htsearch results?

Ted Stresen-Reuter Mon, 17 Feb 2003 05:58:26 -0800

I too had similar problems way back in October of 2002. It _appeared_ to go away when i went back to the 3.1.6 version (was using 3.2.xxx). I say it appeared to go away because there are times when Word/PDF files still end up unusually high in the rank for no apparent reason.

If you can, try installing 3.1.6 and run some tests. Also, before doing the installation, you might try sending the output from sample searches to text files and then compare them against similar searches in 3.1.6 to try and track down where the discrepancies are.

According to the notes from this thread in October, you might try running your HTML against some HTML validation service (make sure all tags are properly closed and such). If a tag is left open, it might not be getting read by htdig (this is just a guess, you'll have to check with the developers to see if this is, in fact, a possibility).

Good luck.

Ted Stresen-Reuter

On Monday, February 17, 2003, at 05:45 AM, David Adams wrote:

You have

keywords_factor: 500

Is it possible that the authors of the PDF documents have been diligent in
setting the keywords whilst the authors of the HTML pages have not bothered?

Just a thought.

David Adams
Corporate Information Services
Information Systems Services
University of Southampton

----- Original Message -----
From: "Michael Boer" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, February 14, 2003 11:18 PM
Subject: [htdig] pdf and doc hits sorted first in htsearch results?

We are running ht://Dig 3.2.0b4-011302 on a Red Hat 7.3 system, installed
from
the standard Red Hat RPMs. We have been using doc2html to parse PDFs and
DOCs,
with the following lines at the end of /etc/htdig.conf:

external_parsers: application/msword->text/html /usr/local/bin/doc2html.pl
\
                   application/postscript->text/html
/usr/local/bin/doc2html.pl \
application/pdf->text/html /usr/local/bin/doc2html.pl

The mystery is: How can we get htsearch to stop bunching all the .pdf and
.doc
files at the top of the results? For reasons unclear to me, all matching
.pdf
files are listed, then all the .docs files, and then all the .html files.

Our search algorithm and weighting factors are like this:

search_algorithm: exact:1 synonyms:0.2 endings:0.1

#backlink_factor: 1000.0
#date_factor: 0.00
#description_factor: 150
#heading_factor: 5.0
keywords_factor: 500
meta_description_factor: 100
#text_factor: 1
#title_factor: 100
heading_factor_1: 10
heading_factor_2: 5
heading_factor_3: 4
#heading_factor_4: 1
#heading_factor_5: 1
#heading_factor_6: 0

Any suggestions? (We're just about ready to give up indexing .pdf and
.doc
files altogether.)

-------------------------------------------------------
This SF.NET email is sponsored by: FREE SSL Guide from Thawte
are you planning your Web Server Security? Click here to get a FREE
Thawte SSL guide and find the answers to all your SSL security issues.
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to

<[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] pdf and doc hits sorted first in htsearch results?

Reply via email to