If you want you can also `pause' fulltext-indexing by rewinding the
time:
$ echo "UPDATE idxINDEX SET last_updated='2100-01-01' WHERE name='fulltext'" | \
   /opt/cds-invenio/bin/dbexec
and resume it manually later. (e.g. when encoding troubles are solved)
Super! That's _exactly_ what i was looking for...


We should not be generating such errors.  What version of pdftotext
(xpdf, poppler) are you running?  Is its output UTF-8 perfect?  Is your
DB running in nice UTF-8 mode?  Can we get the test file to check if our
dev branch behaves fine?
I was running poppler 0.6.x, but after your reply, I realized that several new (stable) versions have been released for that package, so I updated to 0.10.5... I'm still getting the same errors. The output of "pdftotext -enc UTF-8 input.pdf output.txt" is not perfect (some words in the exported text file are split the wrong way, probably the fact that non-latin 2byte characters are used is not taken into consideration, but this is not your fault :). Having said that, adding the "-layout" switch, solves the problem. Oh, and I sould probably mention that some of our pdf docs are simply jpg images, converted to pdf. Running pdftotext on these should probably create a lot of garbage...

mysql should be ok as far as charset/collation is concerned:
character set client    utf8
character set connection        utf8
character set database  utf8
character set filesystem        binary
character set results   utf8
character set server    utf8
character set system    utf8
collation connection    utf8_unicode_ci
(Global value)  utf8_general_ci
collation database      utf8_general_ci
collation server        utf8_general_ci

btw, you are more than welcome to use the fulltext in order to perform any test you wish!

Just for the history of things, I'm using Upload_Files.py websubmit function, so up to now, i couldn't take advantage of the template (*.tpl) files to insert 8564_u into MARC (but even without it, invenio is smart enough to figure out the related fulltext files). Having said that, I was recently asked to put the fulltext links in the search page as well, so I had to run bibdocfile --fix-marc for some collections, so several 856s were created and after the scheduled bibindex was run, I begun to get the registered exceptions.

I'm not sure, but the exception _seems_ to be thrown only in filenames that contain spaces and/or greek characters... I'll be happy to give you any additional info/fulltext files/logs/etc you may need...

Best regards,
Theodoropoulos Theodoros

ps. The fact that ghostscript is also complaining ("GPL Ghostscript 8.62: Unrecoverable error, exit code 1") should not worry me?

Reply via email to