Re: [htdig] Deleted, no excerpt with pdf files

David Adams Mon, 04 Mar 2002 08:40:46 -0800

Title: Message

Steve,

You must:

Put the full pathname of your Perl binary in the first line of each of your Perl scripts.

Configure doc2html.pl with the full pathname of where you have installed pdf2html.pl.

Configure pdf2html.pl with the full pathnames of where you have installed pdftotext and pdfinfo.

Test pdf2html.pl at the command line:

pdf2html.pl /usr/website/pdfs/phoenix.pdf

If that works then try doc2html.pl:

/opt/www/htdig/bin/doc2html.pl /usr/website/pdfs/phoenix.pdf application/pdf

If that still fails, then you still havn't configured doc2html.pl correctly.

--
David Adams
Computing Services
Southampton University

----- Original Message -----

From: Steve Marshall

To: 'David Adams'

Sent: Monday, March 04, 2002 4:13 PM

Subject: RE: [htdig] Deleted, no excerpt with pdf files

David

Thanks for responding so quickly:-

I have doc2html.pl located in the /opt/www/htdig/bin directory

(BTW I'm not sure whether you intended "application/pdf" should be substituted)

If I leave it unchanged ("# perl /opt/www/htdig/bin/doc2html.pl /usr/website/pdfs/phoenix.pdf  application/pdf") I get

!     'UNABLE TO CONVERT'

!!    /opt/www/htdig/bin/: is a directory

with

"# perl /opt/www/htdig/bin/doc2html.pl /usr/website/pdfs/phoenix.pdf /usr/local/bin/pdftotext http://foo/file.pdf"

I get

!     'UNABLE TO CONVERT'

but

"# perl /opt/www/htdig/bin/conv_doc.pl /usr/website/pdfs/phoenix.pdf /usr/local/bin/pdftotext http://foo/file.pdf"

produces a valid html file on the console.

BTW I have to actually type "perl" to run the script?

Regards

Steve Marshall

-----Original Message-----
From: David Adams [mailto:[EMAIL PROTECTED]]
Sent: 04 March 2002 11:33
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [htdig] Deleted, no excerpt with pdf files

Try running doc2html.pl from the command line:

    /opt/www/htdig/bin/doc2html.pl filename.pdf application/pdf

where filename.pdf is the full path name of a PDF document.

--
David Adams
Computing Services
Southampton University

----- Original Message -----

From: Steve Marshall

To: [EMAIL PROTECTED]

Sent: Monday, March 04, 2002 10:08 AM

Subject: [htdig] Deleted, no excerpt with pdf files

//htDig is working fine for us with a large intranet 2Gig or so which is entirely graphics & .html. I want to index pdfs too of course.

I am running the doc2html.pl script on a very simple (test) index.html file which links only to a .GIF and small .pdf file.( I have tried parse_doc & conv_doc too)

I have the latest XPDF, and pdftotext works fine on the same .pdf at the command line and produces a perfect .txt file

When I run htDig with the -vvvvv option it lists all the lines in that .pdf file as plain text so it is apparently parsing properly.

However when I try to htmerge I get a "Deleted, no exerpt" message. The wordlist file is tiny.

I can see from an earlier response that the problem might be that the parser hasn't emitted a usable "h" record - how would I go about fixing that? Would this apply to a .txt file - the test output hasn't got any tags (of course).

This is the only relevant uncommented line in htdig.conf

external parsers        application/pdf->text/html /opt/www/htdig/bin/doc2html.pl

Any help gratefully appreciated

Steve Marshall

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________

Re: [htdig] Deleted, no excerpt with pdf files

Reply via email to