According to COLLINEAUFranckFTRD/DMI/TAM: > I have downloaded the 0.93 version of Xpdftotext. > If i run manually pdftotext, it works. > If i run manually pdf2html.pl, it works.
You say it works, but does the text it produces make sense? > But if i run manually doc2html.pl, i have the message: UNABLE TO CONVERT ! When you run doc2html.pl from the command line, you need to give it the content type as the second argument, and should probably also give it the URL as the third argument. E.g.: /path/to/doc2html.pl /dir/to/my/file.pdf application/pdf http://foo/file.pdf Theoretically this should give the same output as pdf2html.pl, because by default I believe doc2html.pl calls pdf2html.pl to parse PDFs. This output should be pretty much the same as what pdftotext -raw puts out, but wrapped in simple HTML. The important thing is to determine what is the cause of the gibberish text. Is it pdftotext, the script, or htdig? The only way to know is to try each piece in isolation. Your previous e-mail suggested that you had been running doc2html.pl manually and it was giving gibberish output. > De : Gilles Detillieux [mailto:[EMAIL PROTECTED]] ... > According to COLLINEAUFranckFTRD/DMI/TAM: > > In fact I have a problem with doc2html.pl. If i run manually it with a > > scanned PDF file i have an output like something like that (end of file): > > > > - �BP- �BP- �BP- �BP�����S�} � �B�;� oe$�... Z � OE� T A� T A� T A� T A� > T > > A� T A� GZ��o } YtT�ʤVi9�Ubԡ��"Z s�?U�8�Ts *��� ��ͣbs�s������z�\���?� > > I���W�����M�}T��*�J Y~ ]Uk_���E�� � ����ϳ�?j^A�Z,�G- )�Z�e~�~<\� �� > > `��m ?�ߨ�|�`���B����q�_8Ye� �'L�W#��? % �?��cp�j��q !��L}#�? > > Lm5?{�f���e�����wg~� � ��s U��u��X�֮��� Z�f�6 j;��k��n� �8�? > "�֭l�x��ݨf > > }?�g[�;�d~ �� ��Z� oeCO �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP- �BP- > > �BP-�?C� ��?C��I� ��~ �F "a z*<-} �G�" 7�( Q�� � �^ K��(tm)� �@����'� �ɲ� > > ��Ct@J?� ��9� � �6 � R� ?'g EUR��~� 2:�� EUR �1: �/� ��9: > > ��L�K?��f/õX��Z?�j���[��R�G9EWd3-�j�>�u�.�">O[�?� > > ��V<nHh�)�!Y��9�!Y�[]�s��OE� ��V�f ����x���?�-��m5 > > T��a->����-)�">e��Vk5�c�V��L� �|�gz�+���p�? > > G�b�fG�TW<��f+�_Vk�-�2=EWd�-n ��h���ߣ+�(tm)�j�9�">K]� � ����2FWd��y<�� > > �D ��Gt@J � )y� �O_� x?���z> endstream endobj 1 0 obj << /Type > > /Pages /Kids [ 5 0 R ] /Count 1 >> endobj 2 0 obj << /ModDate > > (D:20000927100335+02'00') /CreationDate (D:20000927100326+02'00') /Creator > > (Acrobat 3.0 Scan Plug-in ) /Producer (Acrobat 3.0 Scan Plug-in ) >> > > endobj xref 0 3 0000000000 65535 f > > 0000066182 00000 n > > 0000066246 00000 n > > trailer << /Size 3 > > > /ID[<1d60b14e3ac8285779f86361eb9f59b5><1d60b14e3ac8285779f86361eb9f > > 59b5>] >> startxref 173 %%EOF > > </PRE> > > </BODY> > > </HTML> > > > > 1)In doc2html.pl script I have set $PDF2HTML with either pdf2html.pl and > > pdftotext script. I have set pdf2html.pl with pdftotext and pdfinfo. > > The outputs are the same. > > Nota: the pdf files are the result of a scanning. > > If i run with a no scanned PDF files, i have "! UNABLE TO CONVERT" > > > > 2)If i run manually with pdf2html.pl alone, it works with no scanned > files; > > with scanned files i have an empty html page. > > > > I don't understand why doc2html.pl doesn't work then i have set it with > > pdf2html.pl. > > Are scanned pdf files indexable ? > > It certainly sounds like your PDFs don't contain any indexable text. > pdf2html.pl uses pdftotext, which is part of the xpdf package, as the > conversion tool to get the text from the PDFs. You don't happen to > mention which version of xpdf you're running. Make sure you have the > latest version (see http://www.foolabs.com/xpdf/). If pdftotext can't get > any usable text from your PDFs (you can try running it manually to see), > then there's not a whole lot else you can do to make use of these files. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

