Re: [htdig] Deleted, no excerpt with pdf files

David Adams Mon, 04 Mar 2002 03:23:33 -0800

Title: Deleted, no excerpt with pdf files

Try running doc2html.pl from the command line:

/opt/www/htdig/bin/doc2html.pl filename.pdf application/pdf

where filename.pdf is the full path name of a PDF document.

--
David Adams
Computing Services
Southampton University

----- Original Message -----

From: Steve Marshall

To: [EMAIL PROTECTED]

Sent: Monday, March 04, 2002 10:08 AM

Subject: [htdig] Deleted, no excerpt with pdf files

//htDig is working fine for us with a large intranet 2Gig or so which is entirely graphics & .html. I want to index pdfs too of course.

I am running the doc2html.pl script on a very simple (test) index.html file which links only to a .GIF and small .pdf file.( I have tried parse_doc & conv_doc too)

I have the latest XPDF, and pdftotext works fine on the same .pdf at the command line and produces a perfect .txt file

When I run htDig with the -vvvvv option it lists all the lines in that .pdf file as plain text so it is apparently parsing properly.

However when I try to htmerge I get a "Deleted, no exerpt" message. The wordlist file is tiny.

I can see from an earlier response that the problem might be that the parser hasn't emitted a usable "h" record - how would I go about fixing that? Would this apply to a .txt file - the test output hasn't got any tags (of course).

This is the only relevant uncommented line in htdig.conf

external parsers application/pdf->text/html /opt/www/htdig/bin/doc2html.pl

Any help gratefully appreciated

Steve Marshall

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

Re: [htdig] Deleted, no excerpt with pdf files

Reply via email to