RE: [htdig] Error Msg when ht:Digging PDF files

Cutts III, James H. Mon, 29 Nov 2004 06:07:14 -0800

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf 
> Of Cutts III, James H.
> Sent: Tuesday, November 23, 2004 5:37 PM
> To: [EMAIL PROTECTED]
> Subject: RE: [htdig] Error Msg when ht:Digging PDF files
> 
> 
> >-----Original Message-----
> >From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Milan
> >Andric
> >Sent: Monday, November 22, 2004 2:49 PM
> >To: [EMAIL PROTECTED]
> >Subject: Re: [htdig] Error Msg when ht:Digging PDF files
> >
> >
> >On Mon, Nov 22, 2004 at 02:32:05PM -0600, Cutts III, James H. wrote:
> >> I am slowly working my way through the process of getting PDF files
> to 
> >> be indexed by ht://Dig.  I've found and installed the xpdf 2.01-11
> and 
> >> verified that pdftotext works. I've installed and modified the
> >> doc2html.pl.  I've modified the pdf2html.pl files.  And 
> I've created 
> >> an html file that is points to my PDF files and tweaked my 
> htdig.conf
> 
> >> to include the external_parsers: command.
> >> 
> >> I run htdig -vv -i -c htdig.conf and I get the following errors
> >> 
> >> External parser error: can't parse Content-Type "txt/html"
> >>  URL:
> >>
> http://128.206.75.187/cori_kbase_jhc/pdfs/missouri_hmo/Commuity%20Care
> >> Pl
> >> us-Hospitals%20Expansion%202-99pc.pdf
> >> 
> >> Once for each pdf file.
> >> 
> >> Any suggestions?  The file displays nicely in a web browser.  I
> >> suspect that it may be the setup of the Apache server and the mime 
> >> type that it's sending.
> >
> >the default content-type header for html/apache is
> >Content-Type: text/html; charset=ISO-8859-1
> >
> >for pdf you probably want to use
> >Content-Type: application/pdf
> >
> >this should happen automatically with mime_module apache module. the
> mime.types file by default should contain 
> >application/pdf                      pdf
> >
> >some browsers will figure out by the file extension how to 
> open a pdf 
> >file.
> >
> >"Content-Type: txt/html" header is just wrong, i think.
> >
> >maybe if you fix the header, it should work better.
> >
> >--
> >Milan
> >
> >
> 
> Milan, 
> 
> Thanks for the suggestion.  That was what I had thought.  I 
> worked carefully thought my Apache configuration files and 
> they certainly looked correct.  I then started poking at the 
> doc2html.pl script as I could add debugging statements to it. 
>  My testing version of the doc2html.pl script is now quite 
> verbose.  Here is what I've identified.
> 
> 1. htDig calls Apache for a page
> 2. Apache returns page to htDig
> (I have htDig configured for a maximum size larger than the 
> largest .PDF in my test collection. So it's not an issue of 
> the parser choking on a partial PDF file.) 3. htDig 
> identifies the page as a PDF file. 4. htDig passes the page 
> to the external parser. (I've correctly configured htDig to 
> use the doc2html.pl script that comes with htDig.  I know 
> this is working because I've stuck debugging comments through 
> the script and I'm able to trace the execution through the 
> script.) 5. The doc2html.pl uses the file extension, magic 
> code and MIME type to determine the specific appropriate 
> conversion utility. (In this case, the conversion utility is 
> the pdf2html.pl script that also comes with htDig.  I know 
> that this script is working because I've stuck debugging 
> comments through the script and I'm able to trace the 
> execution through the script.) 6. The pdf2html.pl script 
> calls the pdftotext program to convert the file from pdf to a 
> text stream.  The pdf2html wraps the text stream in HTML tags 
> to be returned to (eventually) htDig. (The system is dying  
> where the pdf2html.pl script calls the pdftotext application. 
>  The pdftotext application is opened as a pipe returning the 
> results of the conversion to the pdf2html.pl script.  
> However, it appears that the pdftotext program is failing in 
> a way to cause pd2html.pl to abend as the error trapping 
> statements are not being executed.
>  
> I checked the syntax of the command being executed from 
> within the pdf2html.pl at the command line.  It works 
> perfectly converting the document nicely.
>  
> The documents I am trying to process are ones that we have 
> created ourselves with Adobe Acrobat from scanned sources. 
> They successfully convert when run from the command line.)
> 
> So that's where I've gotten to.  I've found where the system 
> is blowing up, but I can't identify why.  I've been doing my 
> testing as root. While this is not really how I'm going to 
> want to run the system in the future, it minimizes rights issues.
> 
> Any further assistance and suggestions would be much 
> appreciated.  I could rather easily rewrite the pdf2html.pl 
> script to call the pdftotext application in a different 
> manner, but I don't know if that would really solve the problem.
> 
> Thanks,
> 
> James H. Cutts III
> CORI - 143C Mumford
>



Further information for those following this interesting case.  The
doc2html.pl and pdf2html.pl scripts are working correctly when called
from outside htDig.  It is only when the scripts are called by htDig as
an external parser, that they die.  

I've added debugging lines that let me see that the scripts are being
terminated in the middle of pdf2html.pl processing the converted text.
Neither pdf2html.pl nor doc2html.pl finish properly, they both just
terminate.

I was reviewing pdf2html.pl to see if I could rewrite it to replace the
doc2html.pl / pdf2html.pl combination when I (finally) noticed the
comment that said that the pdf2html.pl could be called directly from
htDig as an external convert.  So I adjusted my htdig.conf to call
pdf2html.pl instead of doc2html.pl.  The results indicated the change in
the htdig.conf file worked, but the abrupt termination of the
pdf2html.pl script persisted.

I wonder if htDig has a timeout for the external parser that somehow had
gotten to be an incredibly short time.  The only reference to a timeout
for the .conf file is for the return from the web server.  Does anyone
know if htDig terminates external parsers after a specific period of
time?

Happy Thanksgiving,
James H. Cutts III

P.S.  I'm running htDig 3.2.0b4 on RedHat Linux 9 on a Dell 2600 with a
single Xeon processor.


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

RE: [htdig] Error Msg when ht:Digging PDF files

Reply via email to