On Thu, 27 Jun 2013, [email protected] wrote:
> My question is: how can I «check text extraction procedures» as you have
> recommended me?

You can run your version of pdftotext on various files and observe
success rate and compare to the silent `.text' files that Invenio uses.

Basically, the stemming should be always on for the fulltext index.  You
can install PyStemmer and switch it on in the BibIndex Admin UI and then
reindex.  Having stemming on will help in reducing the number of terms
considerably.

Otherwise we may want to be more aggressive on the gibberish occurring
from time to time in the pdftotext output.  I.e. analyse .text files and
throw away lines if they look like gibberish based on encoding, spell
checking, etc.  This would have to be coded.

Also, there are other options available to make text out of PDF, such as
Apache PDFBox.  They may be perhaps more suitable to your files than
pdftotext.  An analysis of a sample of PDF documents that currently
produce gibberish for you will be able to tell on this.

> Reindexing using those parameters should be enough to remove my bogus
> entries?

Yes, it should be OK.

> I've checked and the other indexes don't have control characters.

These usually appear when pdftotext cannot do a good job for a reason or
another when extracting text.  So only the fulltext index is usually
concerned.

Best regards
--
Tibor Simko

Reply via email to