Re: [htdig] Re: problem with pdf files

Gilles Detillieux Tue, 18 Dec 2001 14:04:09 -0800

According to COLLINEAUFranckFTRD/DMI/TAM:
> In fact I have a problem with doc2html.pl. If i run manually it with a
> scanned PDF file i have an output like something like that (end of file):
> 
> - �BP- �BP- �BP- �BP�����S�} � �B�;� oe$�... Z � OE� T A� T A� T A� T A� T
> A� T A� GZ��o } YtT�ʤVi9�Ubԡ��"Z s�?U�8�Ts *��� ��ͣbs�s������z�\���?�
> I���W�����M�}T��*�J Y~ ]Uk_���E�� � ����ϳ�?j^A�Z,�G- )�Z�e~�~&lt;\� ��
> `��m ?�ߨ�|�`���B����q�_8Ye� �'L�W#��? % �?��cp�j��q !��L}#�?
> Lm5?{�f���e�����wg~� � ��s U��u��X�֮��� Z�f�6 j;��k��n� �8�? "�֭l�x��ݨf
> }?�g[�;�d~ �� ��Z� oeCO �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP-
> �BP-�?C� ��?C��I� ��~ �F "a z*<-} �G�" 7�( Q�� � �^ K��(tm)� �@����'� �ɲ�
> ��Ct@J?� ��9� � �6 � R� ?'g EUR��~� 2:�� EUR �1: �/� ��9:
> ��L�K?��f/õX��Z?�j���[��R�G9EWd3-�j�&gt;�u�.�">O[�?�
> ��V<nHh�)�!Y��9�!Y�[]�s��OE� ��V�f ����x���?�-��m5
> T��a-&gt;����-)�">e��Vk5�c�V��L� �|�gz�+���p�?
> G�b�fG�TW<��f+�_Vk�-�2=EWd�-n ��h���ߣ+�(tm)�j�9�">K]� � ����2FWd��y&lt;��
> �޽D ��Gt@J � )y� �O_� x?���z&gt; endstream endobj 1 0 obj &lt;&lt; /Type
> /Pages /Kids [ 5 0 R ] /Count 1 &gt;&gt; endobj 2 0 obj &lt;&lt; /ModDate
> (D:20000927100335+02'00') /CreationDate (D:20000927100326+02'00') /Creator
> (Acrobat 3.0 Scan Plug-in ) /Producer (Acrobat 3.0 Scan Plug-in ) &gt;&gt;
> endobj xref 0 3 0000000000 65535 f
> 0000066182 00000 n
> 0000066246 00000 n
> trailer &lt;&lt; /Size 3
> /ID[&lt;1d60b14e3ac8285779f86361eb9f59b5&gt;&lt;1d60b14e3ac8285779f86361eb9f
> 59b5&gt;] &gt;&gt; startxref 173 %%EOF
> </PRE>
> </BODY>
> </HTML>
> 
> 1)In doc2html.pl script I have set $PDF2HTML with either pdf2html.pl and
> pdftotext script. I have set pdf2html.pl with pdftotext and pdfinfo.
> The outputs are the same. 
> Nota: the pdf files are the result of a scanning. 
> If i run with a no scanned PDF files, i have "! UNABLE TO CONVERT"
> 
> 2)If i run manually with pdf2html.pl alone, it works with no scanned files;
> with scanned files i have an empty html page.
> 
> I don't understand why doc2html.pl doesn't work then i have set it with
> pdf2html.pl.
> Are scanned pdf files indexable ?


It certainly sounds like your PDFs don't contain any indexable text.
pdf2html.pl uses pdftotext, which is part of the xpdf package, as the
conversion tool to get the text from the PDFs.  You don't happen to
mention which version of xpdf you're running.  Make sure you have the
latest version (see http://www.foolabs.com/xpdf/).  If pdftotext can't get
any usable text from your PDFs (you can try running it manually to see),
then there's not a whole lot else you can do to make use of these files.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Re: problem with pdf files

Reply via email to