Re: [htdig] Re: problem with pdf files

Gilles Detillieux Thu, 20 Dec 2001 12:37:35 -0800

According to COLLINEAUFranckFTRD/DMI/TAM:
> I have downloaded the 0.93 version of Xpdftotext.
> If i run manually pdftotext, it works.
> If i run manually pdf2html.pl, it works.


You say it works, but does the text it produces make sense?

> But if i run manually doc2html.pl, i have the message: UNABLE TO CONVERT !

When you run doc2html.pl from the command line, you need to give it
the content type as the second argument, and should probably also give
it the URL as the third argument.  E.g.:

/path/to/doc2html.pl /dir/to/my/file.pdf application/pdf http://foo/file.pdf

Theoretically this should give the same output as pdf2html.pl, because
by default I believe doc2html.pl calls pdf2html.pl to parse PDFs.  This
output should be pretty much the same as what pdftotext -raw puts out,
but wrapped in simple HTML.

The important thing is to determine what is the cause of the gibberish
text.  Is it pdftotext, the script, or htdig?  The only way to know is
to try each piece in isolation.  Your previous e-mail suggested that
you had been running doc2html.pl manually and it was giving gibberish
output.

> De : Gilles Detillieux [mailto:[EMAIL PROTECTED]]
...
> According to COLLINEAUFranckFTRD/DMI/TAM:
> > In fact I have a problem with doc2html.pl. If i run manually it with a
> > scanned PDF file i have an output like something like that (end of file):
> > 
> > - �BP- �BP- �BP- �BP�����S�} � �B�;� oe$�... Z � OE� T A� T A� T A� T A�
> T
> > A� T A� GZ��o } YtT�ʤVi9�Ubԡ��"Z s�?U�8�Ts *��� ��ͣbs�s������z�\���?�
> > I���W�����M�}T��*�J Y~ ]Uk_���E�� � ����ϳ�?j^A�Z,�G- )�Z�e~�~&lt;\� ��
> > `��m ?�ߨ�|�`���B����q�_8Ye� �'L�W#��? % �?��cp�j��q !��L}#�?
> > Lm5?{�f���e�����wg~� � ��s U��u��X�֮��� Z�f�6 j;��k��n� �8�?
> "�֭l�x��ݨf
> > }?�g[�;�d~ �� ��Z� oeCO �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP-
> > �BP-�?C� ��?C��I� ��~ �F "a z*<-} �G�" 7�( Q�� � �^ K��(tm)� �@����'� �ɲ�
> > ��Ct@J?� ��9� � �6 � R� ?'g EUR��~� 2:�� EUR �1: �/� ��9:
> > ��L�K?��f/õX��Z?�j���[��R�G9EWd3-�j�&gt;�u�.�">O[�?�
> > ��V<nHh�)�!Y��9�!Y�[]�s��OE� ��V�f ����x���?�-��m5
> > T��a-&gt;����-)�">e��Vk5�c�V��L� �|�gz�+���p�?
> > G�b�fG�TW<��f+�_Vk�-�2=EWd�-n ��h���ߣ+�(tm)�j�9�">K]� � ����2FWd��y&lt;��
> > �޽D ��Gt@J � )y� �O_� x?���z&gt; endstream endobj 1 0 obj &lt;&lt; /Type
> > /Pages /Kids [ 5 0 R ] /Count 1 &gt;&gt; endobj 2 0 obj &lt;&lt; /ModDate
> > (D:20000927100335+02'00') /CreationDate (D:20000927100326+02'00') /Creator
> > (Acrobat 3.0 Scan Plug-in ) /Producer (Acrobat 3.0 Scan Plug-in ) &gt;&gt;
> > endobj xref 0 3 0000000000 65535 f
> > 0000066182 00000 n
> > 0000066246 00000 n
> > trailer &lt;&lt; /Size 3
> >
> /ID[&lt;1d60b14e3ac8285779f86361eb9f59b5&gt;&lt;1d60b14e3ac8285779f86361eb9f
> > 59b5&gt;] &gt;&gt; startxref 173 %%EOF
> > </PRE>
> > </BODY>
> > </HTML>
> > 
> > 1)In doc2html.pl script I have set $PDF2HTML with either pdf2html.pl and
> > pdftotext script. I have set pdf2html.pl with pdftotext and pdfinfo.
> > The outputs are the same. 
> > Nota: the pdf files are the result of a scanning. 
> > If i run with a no scanned PDF files, i have "! UNABLE TO CONVERT"
> > 
> > 2)If i run manually with pdf2html.pl alone, it works with no scanned
> files;
> > with scanned files i have an empty html page.
> > 
> > I don't understand why doc2html.pl doesn't work then i have set it with
> > pdf2html.pl.
> > Are scanned pdf files indexable ?
> 
> It certainly sounds like your PDFs don't contain any indexable text.
> pdf2html.pl uses pdftotext, which is part of the xpdf package, as the
> conversion tool to get the text from the PDFs.  You don't happen to
> mention which version of xpdf you're running.  Make sure you have the
> latest version (see http://www.foolabs.com/xpdf/).  If pdftotext can't get
> any usable text from your PDFs (you can try running it manually to see),
> then there's not a whole lot else you can do to make use of these files.


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Re: problem with pdf files

Reply via email to