Author: Richard Wall
Email: [EMAIL PROTECTED]
Message:
I have the same problem. I have been using 'pdftotext'. But if you look at the text
files that this produces, there is no meta info included, so mnogosearch is never
going to extract meaningful titles. The solution, I think, is to convert the pdf file
to html.
One such tool is 'pdftohtml'...
http://www.ra.informatik.uni-stuttgart.de/~gosho/pdftohtml/
But this isn't any good either, because it doesn't extract the title and keywords. It
just titles the output pages, 'index1, index2, etc'
The HtDig mailing lists recommend using a program called 'doctohtml'...
http://www.htdig.org/files/contrib/parsers/
This is a perl wrapper script which converts pdf,msword,wordperfect and others to
html.
When converting pdf files, it uses pdfinfo to extract meta data (title, keywords
etc)from the pdf file, and generates the html <head> info and pdftotext to generate
the html <body> info. Both these programs are in the xpdf suite...
http://www.foolabs.com/xpdf/xpdf.html
So doc2html seems like the ideal solution BUT I can't get it to work, and wondered if
anyone else had got any tips on how to use it.
Richard Wall
CWN Web Developer
URL: http://www.cwn.org.uk
PGP Key ID: 0xAC33A456
***********************
LOOKING FOR A JOB? visit http://www.covjobs.co.uk
Reply: <http://search.mnogo.ru/board/message.php?id=2099>
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]