Richard Skelton wrote:
>
> Hi,
> Sorry if this question has been answered before as I have just found htdig.
> I have been looking for a search engine for pdf's and seem to have found the
> answer in htdig.
> My first tests where on a mix of html's and pdf's and I found that htdig
> processed my pdf's fine, but if I searched for words that were unique in my file
> junk2.pdf it did not find them.
> Today I have indexed a directory containing pdf's produced by ghostscript and
> don't seem to have a problem, so I went back to the file junk2.pdf and found it
> was produced by the program text2pdf.
> Is this the problem?
It may very well be!
It is a known problem with PDF that some files are produced with fonts
re-encoded (for a variety of reasons), so that "Hello" might indeed be
coded "Zkhha" (one character substituted by another one).
These files cannot be retrieved in a search, because it all happens in
the background and nobody knows you should search for Zkhha when you
want Hello. There's no way around that, however great your search server
is...
Nevertheless, it does not happen very often. Maybe text2pdf indeed does
font reencoding. You'll have to check that (and check if you can
override this setting).
Hope this helps.
Jacques Le Mouel
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.