On Thu, 27 Jun 2013, [email protected] wrote: > My question is: how can I «check text extraction procedures» as you have > recommended me?
You can run your version of pdftotext on various files and observe success rate and compare to the silent `.text' files that Invenio uses. Basically, the stemming should be always on for the fulltext index. You can install PyStemmer and switch it on in the BibIndex Admin UI and then reindex. Having stemming on will help in reducing the number of terms considerably. Otherwise we may want to be more aggressive on the gibberish occurring from time to time in the pdftotext output. I.e. analyse .text files and throw away lines if they look like gibberish based on encoding, spell checking, etc. This would have to be coded. Also, there are other options available to make text out of PDF, such as Apache PDFBox. They may be perhaps more suitable to your files than pdftotext. An analysis of a sample of PDF documents that currently produce gibberish for you will be able to tell on this. > Reindexing using those parameters should be enough to remove my bogus > entries? Yes, it should be OK. > I've checked and the other indexes don't have control characters. These usually appear when pdftotext cannot do a good job for a reason or another when extracting text. So only the fulltext index is usually concerned. Best regards -- Tibor Simko

