Hi Tibor, Tibor Simko <[email protected]> wrote: > > On Thu, 27 Jun 2013, [email protected] wrote: >> My question is: how can I «check text extraction procedures» as you >> have recommended me? > > You can run your version of pdftotext on various files and observe > success rate and compare to the silent `.text' files that Invenio > uses. > > Basically, the stemming should be always on for the fulltext index. > You can install PyStemmer and switch it on in the BibIndex Admin UI > and then reindex. Having stemming on will help in reducing the number > of terms considerably.
Of course, now I understand! To tell you the truth, now that PyStemmer is already debianized (http://packages.debian.org/python-stemmer), I started to work on it, but I had several doubts: on DDD, I didn't know which language to choose, because, although I have no precise statistics, maybe we have one third among Catalan, Spanish and English, and some more. For Traces, where the choice is clear for Catalan, there are Catalan rules upstream (http://snowball.tartarus.org/algorithms/catalan/stemmer.html), but not yet packaged for Debian. I tried to backport it, but I failed. We'll have to choose something meanwhile... I'll also take a look at pdftotext output. Thanks again, Ferran

