Webboard: Titles incorrect for pdf files

Richard Wall Wed, 25 Apr 2001 04:01:47 -0700
Author: Richard Wall
Email: [EMAIL PROTECTED]
Message:
I have the same problem. I have been using 'pdftotext'. But if you look at the text 
files that this produces, there is no meta info included, so mnogosearch is never 
going to extract meaningful titles. The solution, I think, is to convert the pdf file 
to html. 
One such tool is 'pdftohtml'...
http://www.ra.informatik.uni-stuttgart.de/~gosho/pdftohtml/

But this isn't any good either, because it doesn't extract the title and keywords. It 
just titles the output pages, 'index1, index2, etc'

The HtDig mailing lists recommend using a program called 'doctohtml'...
http://www.htdig.org/files/contrib/parsers/

This is a perl wrapper script which converts pdf,msword,wordperfect and others to 
html. 

When converting pdf files, it uses pdfinfo to extract meta data (title, keywords 
etc)from the pdf file, and generates the html <head> info and pdftotext to generate 
the html <body> info. Both these programs are in the xpdf suite...
http://www.foolabs.com/xpdf/xpdf.html

So doc2html seems like the ideal solution BUT I can't get it to work, and wondered if 
anyone else had got any tips on how to use it.

Richard Wall
CWN Web Developer
URL: http://www.cwn.org.uk
PGP Key ID: 0xAC33A456
***********************
LOOKING FOR A JOB? visit http://www.covjobs.co.uk


Reply: <http://search.mnogo.ru/board/message.php?id=2099>

___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]
Webboard: Titles incorrect for pdf files

Reply via email to