Hello Samuele,

> In data giovedì 27 maggio 2010 11:01:45, Ferran Jorba ha scritto:
>> a few weeks ago, I found out that html2text is inadequate to create
>> plain text from HTML, because it only knows about iso-8859-1.  I suggest
>> lynx instead, as I explained in this mail (attached fragment).
>> 
>> As I haven't seen any news about this, maybe I should create a ticket
>> for it.  But I don't know where, because I'm not up to date about
>> your trac migration.
>> 
>> Should I do it myself?  Your guidance will be appreciated.
>
> actually in the current master, the text extraction and other file
> conversions have been reimplemented from scratch via the new websubmit
> converter tool library.

And if this document does not come in via websubmit?  How are they
handled?  In our workflow, where we (try to) collect from many different
ways, websubmit is just one of them, and for html documents, never used.
We use a home made web based wget wrapper.

In 0.99.1 this conversion is part of bibindex.  Maybe are you expecting
us to create the .txt file alongside the original one?

> The HTML to text conversion is done in house via Python, producing UTF8 
> documents. 
>
> If you take the latest GIT version you can test it yourself via:
>
> $ sudo -u www-data python /opt/cds-
> invenio/lib/python/invenio/websubmit_file_converter.py --convert foo.html -o 
> foo.txt
>
> Or feel free to send us some test HTML to see if the output actually 
> correspond to what you expected.

I've just taken a look at the source.  I guess it is
http://cdsware.cern.ch/repo/?p=cds-invenio.git;a=blob;f=modules/websubmit/lib/websubmit_file_converter.py;h=ab2415e7610eda852ed61864ce537de1fdbaae9a;hb=HEAD#l771

It looks simple and probably right.  I'll try to isolate it and test it
myself as standalone tool.

> Best regards,
>       Samuele
>
> P.s. The final goal of the converter tools library is to actually
> implement a plugin framework (not yet in GIT), so that you will be
> able to drop in any kind of conversion by writing extremely easy
> plugins (actually you will be able to use most of the existing CLI
> tools by simply adding a line in a config file)

We'll wait and see.

Thanks,

Ferran

Reply via email to