Hi Ferran,

In data giovedì 27 maggio 2010 12:40:13, Ferran Jorba ha scritto:
> > In data giovedì 27 maggio 2010 11:01:45, Ferran Jorba ha scritto:
> >> a few weeks ago, I found out that html2text is inadequate to create
> >> plain text from HTML, because it only knows about iso-8859-1.  I suggest
> >> lynx instead, as I explained in this mail (attached fragment).
> >> 
> >> As I haven't seen any news about this, maybe I should create a ticket
> >> for it.  But I don't know where, because I'm not up to date about
> >> your trac migration.
> >> 
> >> Should I do it myself?  Your guidance will be appreciated.
> > 
> > actually in the current master, the text extraction and other file
> > conversions have been reimplemented from scratch via the new websubmit
> > converter tool library.
> 
> And if this document does not come in via websubmit?  How are they
> handled?  In our workflow, where we (try to) collect from many different
> ways, websubmit is just one of them, and for html documents, never used.
> We use a home made web based wget wrapper.
> 
> In 0.99.1 this conversion is part of bibindex.  Maybe are you expecting
> us to create the .txt file alongside the original one?

Not at all :-) Although the library resides in WebSubmit, is the new central 
place that the whole Invenio will use to perform conversion. WebSubmit is one 
of the main users (as conversions might typically be part of a submission 
workflow), but indeed BibIndex in GIT is already using the library for the 
exact purpose of extracting text (for a given bibdoc, there is even an 
heuristic and a config variable available to choose from which of many 
available formats the text should be extracted). Moreover the extracted text 
is stored on filesystem in a versioned way so that it can be extracted only 
once and later used by BibIndex, BibClassify, Refextract, ... (in BibIndex, it 
will be used as well to extract text from remote URLs in case you are not 
storing all the documents locally in bibdocs.)

> I've just taken a look at the source.  I guess it is
> http://cdsware.cern.ch/repo/?p=cds-invenio.git;a=blob;f=modules/websubmit/l
> ib/websubmit_file_converter.py;h=ab2415e7610eda852ed61864ce537de1fdbaae9a;h
> b=HEAD#l771
> 
> It looks simple and probably right.  I'll try to isolate it and test it
> myself as standalone tool.

great! Let us know about your results, so that we can in case improve it.

Best regards,
        Samuele

-- 
Samuele Kaplun ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to