Hi Ferran, In data giovedì 27 maggio 2010 12:40:13, Ferran Jorba ha scritto: > > In data giovedì 27 maggio 2010 11:01:45, Ferran Jorba ha scritto: > >> a few weeks ago, I found out that html2text is inadequate to create > >> plain text from HTML, because it only knows about iso-8859-1. I suggest > >> lynx instead, as I explained in this mail (attached fragment). > >> > >> As I haven't seen any news about this, maybe I should create a ticket > >> for it. But I don't know where, because I'm not up to date about > >> your trac migration. > >> > >> Should I do it myself? Your guidance will be appreciated. > > > > actually in the current master, the text extraction and other file > > conversions have been reimplemented from scratch via the new websubmit > > converter tool library. > > And if this document does not come in via websubmit? How are they > handled? In our workflow, where we (try to) collect from many different > ways, websubmit is just one of them, and for html documents, never used. > We use a home made web based wget wrapper. > > In 0.99.1 this conversion is part of bibindex. Maybe are you expecting > us to create the .txt file alongside the original one?
Not at all :-) Although the library resides in WebSubmit, is the new central place that the whole Invenio will use to perform conversion. WebSubmit is one of the main users (as conversions might typically be part of a submission workflow), but indeed BibIndex in GIT is already using the library for the exact purpose of extracting text (for a given bibdoc, there is even an heuristic and a config variable available to choose from which of many available formats the text should be extracted). Moreover the extracted text is stored on filesystem in a versioned way so that it can be extracted only once and later used by BibIndex, BibClassify, Refextract, ... (in BibIndex, it will be used as well to extract text from remote URLs in case you are not storing all the documents locally in bibdocs.) > I've just taken a look at the source. I guess it is > http://cdsware.cern.ch/repo/?p=cds-invenio.git;a=blob;f=modules/websubmit/l > ib/websubmit_file_converter.py;h=ab2415e7610eda852ed61864ce537de1fdbaae9a;h > b=HEAD#l771 > > It looks simple and probably right. I'll try to isolate it and test it > myself as standalone tool. great! Let us know about your results, so that we can in case improve it. Best regards, Samuele -- Samuele Kaplun ** CERN Document Server ** <http://cds.cern.ch/>
