I haven't seen any solid responses come across the wire, and I suspect there isn't a product or package that will do exactly what you want.

<blatent_self_promotion>
However, our company's product, PDFTextStream does do a phenomenal job of extracting text and metadata out of PDF documents. It's crazy-fast, has a clean API, and in general gets the job done very nicely. It presents two points of compromise from your idea situation:


1. It only produces text, so you would have to take the text it provides and write it out as an RTF yourself (there are tons of packages and tools that do this). Since the RTF format has pretty weak formatting capabilities compared to PDF (and even compared to HTML+CSS), you'd likely never reproduce the original layout/content of the source document anyway.

2. It is a Java library. You indicated in a later message that you were aiming to use a python package if possible just out of personal preference. Assuming such a thing does not exist, and you are able to introduce a Java component to your project, this would become a non-issue.
</blatent_self_promotion>


Let me know what your questions are.

Chas Emerick
[EMAIL PROTECTED]
Snowtide Informatics Systems

PDFTextStream: fast PDF text extraction for Java apps and Lucene
http://snowtide.com/home/PDFTextStream/


Alexander Straschil wrote:
Hello!

I have to convert an HTML document to rtf with python, was just googling
for an hour and did find nothing ;-(
Has anybody an Idea how to convert (under Linux) an HTML or Pdf Document
to Rtf?


Thanks, AXEL

-- http://mail.python.org/mailman/listinfo/python-list

Reply via email to