Hi Ferran,

In data giovedì 27 maggio 2010 11:01:45, Ferran Jorba ha scritto:
> a few weeks ago, I found out that html2text is inadequate to create
> plain text from HTML, because it only knows about iso-8859-1.  I suggest
> lynx instead, as I explained in this mail (attached fragment).
> 
> As I haven't seen any news about this, maybe I should create a ticket
> for it.  But I don't know where, because I'm not up to date about your
> trac migration.
> 
> Should I do it myself?  Your guidance will be appreciated.

actually in the current master, the text extraction and other file conversions 
have been reimplemented from scratch vua the new websubmit converter tool 
library. 

The HTML to text conversion is done in house via Python, producing UTF8 
documents. 

If you take the latest GIT version you can test it yourself via:

$ sudo -u www-data python /opt/cds-
invenio/lib/python/invenio/websubmit_file_converter.py --convert foo.html -o 
foo.txt

Or feel free to send us some test HTML to see if the output actually 
correspond to what you expected.

Best regards,
        Samuele

P.s. The final goal of the converter tools library is to actually implement a 
plugin framework (not yet in GIT), so that you will be able to drop in any 
kind of conversion by writing extremely easy plugins (actually you will be 
able to use most of the existing CLI tools by simply adding a line in a config 
file)

-- 
Samuele Kaplun ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to