Hi Theodoros: On Wed, 02 Dec 2009, Theodoropoulos Theodoros wrote: > I was wondering if there is a config option (something similar to > CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY) to completely disable > fulltext word extraction and indexing...
Just delete the full-text index in the BibIndex Admin page, or in CLI: $ echo "DELETE FROM idxINDEX WHERE name='fulltext'" | \ /opt/cds-invenio/bin/dbexec You may also want to inactivate `fulltext' search field in the WebSearch Admin interface to prevent this field from being shown to the users as a search option. If you want you can also `pause' fulltext-indexing by rewinding the time: $ echo "UPDATE idxINDEX SET last_updated='2100-01-01' WHERE name='fulltext'" | \ /opt/cds-invenio/bin/dbexec and resume it manually later. (e.g. when encoding troubles are solved) > It seems to cause a lot of troubles in our installation mainly because > some users (despite my warnings) keep uploading files with GREEK > filenames, which freaks out the pdftotext and consequently the > bibindex task [1]. > > [1] The bibsched task error log gives errors similar to these: > 2009-12-02 14:52:47 --> Error while running /usr/bin/pdftotext -enc > UTF-8 /opt/cds-invenio/var/data/files/g20/100543/ΚΑΛΕΑ ΒΑΣΙΛΙΚΗ.pdf;1 > /opt/cds-invenio/var/tmp/tmpijx8viinvenio.tmp.txt for > http://invenio.lib.auth.gr/record/113745/files/%CE%9A%CE%91%CE%9B%CE%95%CE%91%20%CE%92%CE%91%CE%A3%CE%99%CE%9B%CE%99%CE%9A%CE%97.pdf. > GPL Ghostscript 8.62: Unrecoverable error, exit code 1 > GPL Ghostscript 8.62: Unrecoverable error, exit code 1 > /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: > Incorrect string value: '\xF0\x9D\x90\xB4' for column 'term' at row 1 > rc = cur.execute(sql, param) > /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: > Incorrect string value: '\xF0\x9D\x90\xB9' for column 'term' at row 1 > rc = cur.execute(sql, param) > /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: > Incorrect string value: '\xF0\x9D\x90\xBB' for column 'term' at row 1 > rc = cur.execute(sql, param) > /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: > Incorrect string value: '\xF0\x9D\x90\x95' for column 'term' at row 1 > rc = cur.execute(sql, param) > /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: > Incorrect string value: '\xF0\x9D\x9C\x86\xF0\x9D...' for column 'term' > at row 1 > rc = cur.execute(sql, param) > [...] > And several registered exceptions are thrown as well that look like: > Error when putting the term ''\xf0\x9d\x90\xbe\xf0\x9d\x91\x89'' into db > (hitlist=intbitset([113470])): (1062, "Duplicate entry '' for key 2") > [...] We should not be generating such errors. What version of pdftotext (xpdf, poppler) are you running? Is its output UTF-8 perfect? Is your DB running in nice UTF-8 mode? Can we get the test file to check if our dev branch behaves fine? (CC-ing project-cdsware-developers) Best regards -- Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
