Hi Theodoros:

On Wed, 02 Dec 2009, Theodoropoulos Theodoros wrote:
> I was wondering if there is a config option (something similar to
> CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY) to completely disable
> fulltext word extraction and indexing...

Just delete the full-text index in the BibIndex Admin page, or in CLI:

 $ echo "DELETE FROM idxINDEX WHERE name='fulltext'" | \
   /opt/cds-invenio/bin/dbexec

You may also want to inactivate `fulltext' search field in the WebSearch
Admin interface to prevent this field from being shown to the users as a
search option.

If you want you can also `pause' fulltext-indexing by rewinding the
time:

 $ echo "UPDATE idxINDEX SET last_updated='2100-01-01' WHERE name='fulltext'" | 
\
   /opt/cds-invenio/bin/dbexec

and resume it manually later. (e.g. when encoding troubles are solved)

> It seems to cause a lot of troubles in our installation mainly because
> some users (despite my warnings) keep uploading files with GREEK
> filenames, which freaks out the pdftotext and consequently the
> bibindex task [1].
>
> [1] The bibsched task error log gives errors similar to these:
> 2009-12-02 14:52:47 --> Error while running /usr/bin/pdftotext -enc 
> UTF-8 /opt/cds-invenio/var/data/files/g20/100543/ΚΑΛΕΑ ΒΑΣΙΛΙΚΗ.pdf;1 
> /opt/cds-invenio/var/tmp/tmpijx8viinvenio.tmp.txt for 
> http://invenio.lib.auth.gr/record/113745/files/%CE%9A%CE%91%CE%9B%CE%95%CE%91%20%CE%92%CE%91%CE%A3%CE%99%CE%9B%CE%99%CE%9A%CE%97.pdf.
> GPL Ghostscript 8.62: Unrecoverable error, exit code 1
> GPL Ghostscript 8.62: Unrecoverable error, exit code 1
> /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: 
> Incorrect string value: '\xF0\x9D\x90\xB4' for column 'term' at row 1
>    rc = cur.execute(sql, param)
> /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: 
> Incorrect string value: '\xF0\x9D\x90\xB9' for column 'term' at row 1
>    rc = cur.execute(sql, param)
> /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: 
> Incorrect string value: '\xF0\x9D\x90\xBB' for column 'term' at row 1
>    rc = cur.execute(sql, param)
> /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: 
> Incorrect string value: '\xF0\x9D\x90\x95' for column 'term' at row 1
>    rc = cur.execute(sql, param)
> /usr/lib/python2.4/site-packages/invenio/dbquery.py:228: Warning: 
> Incorrect string value: '\xF0\x9D\x9C\x86\xF0\x9D...' for column 'term' 
> at row 1
>    rc = cur.execute(sql, param)
> [...]
> And several registered exceptions are thrown as well that look like:
> Error when putting the term ''\xf0\x9d\x90\xbe\xf0\x9d\x91\x89'' into db 
> (hitlist=intbitset([113470])): (1062, "Duplicate entry '' for key 2")
> [...]

We should not be generating such errors.  What version of pdftotext
(xpdf, poppler) are you running?  Is its output UTF-8 perfect?  Is your
DB running in nice UTF-8 mode?  Can we get the test file to check if our
dev branch behaves fine?

(CC-ing project-cdsware-developers)

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to