Re: Best strategy to recreate fulltext index?

Ferran Jorba Thu, 27 Jun 2013 06:45:31 -0700

Hi Tibor,

Tibor Simko <[email protected]> wrote:
> 
> On Thu, 27 Jun 2013, [email protected] wrote:
>> My question is: how can I «check text extraction procedures» as you
>> have recommended me?
>
> You can run your version of pdftotext on various files and observe
> success rate and compare to the silent `.text' files that Invenio
> uses.
>
> Basically, the stemming should be always on for the fulltext index.
> You can install PyStemmer and switch it on in the BibIndex Admin UI
> and then reindex.  Having stemming on will help in reducing the number
> of terms considerably.


Of course, now I understand!  To tell you the truth, now that PyStemmer
is already debianized (http://packages.debian.org/python-stemmer), I
started to work on it, but I had several doubts: on DDD, I didn't know
which language to choose, because, although I have no precise
statistics, maybe we have one third among Catalan, Spanish and English,
and some more.  For Traces, where the choice is clear for Catalan, there
are Catalan rules upstream
(http://snowball.tartarus.org/algorithms/catalan/stemmer.html), but not
yet packaged for Debian.  I tried to backport it, but I failed.

We'll have to choose something meanwhile...

I'll also take a look at pdftotext output.

Thanks again,

Ferran

Re: Best strategy to recreate fulltext index?

Reply via email to