According to Franck Collineau:
> I have launched rundig -v and i have the messages below.
> When i do a search with a key word that is in the pages, it doesn't find 
> anything !!
> 
> Is my indexation good ?

Well, it doesn't sound like it indexed correctly, but it's hard to say
for sure.  The fact that htmerge didn't remove all PDF files from the
database suggests that htdig did get something from these files, but I
guess the question is what and how much.  To see the actual words that
htdig grabs from each document and keeps in the index, you'd need to
run with -vvvv, which would generate a lot of output, but then you could
look through the output to see if it's finding all the words it should.

Another thing you can do is look through your db.wordlist file to see
what words are in there.  If a search for one of the words in there
still fails to find a match, it would suggest that either htmerge isn't
correctly building the db.words.db database from db.wordlist, or htsearch
isn't correctly searching this database.  On the other hand, if htsearch
results are consistent with what you see in db.wordlist, but this file
doesn't contain all the words it should be getting from the PDFs, then
the problem is with htdig and the external parser or external converter
script you're using, or perhaps with the PDF files themselves.

In an earlier message, you had asked about doc2html.pl, so I assume
that's what you're using.  Are you sure you've set the external_parsers
attribute correctly?  Do you get the correct output when you run
doc2html.pl manually on some of these PDF files?

> New server: r-lx-collineau.rd.francetelecom.fr, 80
> 0:0:0:http://r-lx-collineau.rd.francetelecom.fr/web/essai:  redirect
> 1:1:0:http://r-lx-collineau.rd.francetelecom.fr/web/essai/: ++++++++++++++++ 
> size = 898
> 2:2:1:http://r-lx-collineau.rd.francetelecom.fr/web/essai/page06.pdf:  size = 
> 84559
...
> 17:17:1:http://r-lx-collineau.rd.francetelecom.fr/web/essai/page33.pdf:  size 
> = 145221
> htmerge: Sorting...
> htmerge: Removing doc #0
> htmerge: Merging...
> 
> Deleted, no excerpt: 0/http://r-lx-collineau.rd.francetelecom.fr/web/essai
> htmerge: 10

The only document htmerge deleted from the database is the directory
name above, which caused a redirect.  This is to be expected.  The fact
that none of the PDFs were deleted suggests that there is an excerpt
that htdig got from these files, so they do contain text, and the parser
is finding some of this text.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to