If you want you can also `pause' fulltext-indexing by rewinding the
time:
$ echo "UPDATE idxINDEX SET last_updated='2100-01-01' WHERE
name='fulltext'" | \
/opt/cds-invenio/bin/dbexec
and resume it manually later. (e.g. when encoding troubles are solved)
Super! That's _exactly_ what i was looking for...
We should not be generating such errors. What version of pdftotext
(xpdf, poppler) are you running? Is its output UTF-8 perfect? Is your
DB running in nice UTF-8 mode? Can we get the test file to check if our
dev branch behaves fine?
I was running poppler 0.6.x, but after your reply, I realized that
several new (stable) versions have been released for that package, so
I updated to 0.10.5... I'm still getting the same errors.
The output of "pdftotext -enc UTF-8 input.pdf output.txt" is not
perfect (some words in the exported text file are split the wrong way,
probably the fact that non-latin 2byte characters are used is not
taken into consideration, but this is not your fault :). Having said
that, adding the "-layout" switch, solves the problem. Oh, and I sould
probably mention that some of our pdf docs are simply jpg images,
converted to pdf. Running pdftotext on these should probably create a
lot of garbage...
mysql should be ok as far as charset/collation is concerned:
character set client utf8
character set connection utf8
character set database utf8
character set filesystem binary
character set results utf8
character set server utf8
character set system utf8
collation connection utf8_unicode_ci
(Global value) utf8_general_ci
collation database utf8_general_ci
collation server utf8_general_ci
btw, you are more than welcome to use the fulltext in order to perform
any test you wish!
Just for the history of things, I'm using Upload_Files.py websubmit
function, so up to now, i couldn't take advantage of the template
(*.tpl) files to insert 8564_u into MARC (but even without it, invenio
is smart enough to figure out the related fulltext files). Having said
that, I was recently asked to put the fulltext links in the search
page as well, so I had to run bibdocfile --fix-marc for some
collections, so several 856s were created and after the scheduled
bibindex was run, I begun to get the registered exceptions.
I'm not sure, but the exception _seems_ to be thrown only in filenames
that contain spaces and/or greek characters... I'll be happy to give
you any additional info/fulltext files/logs/etc you may need...
Best regards,
Theodoropoulos Theodoros
ps. The fact that ghostscript is also complaining ("GPL Ghostscript
8.62: Unrecoverable error, exit code 1") should not worry me?