Title: bibindex on fulltext ending with error <urlopen error (4, 'Interrupted system call')>

Hi,

I am encountering problems while trying to index fulltext files in invenio version 0.99.1.

Here is the situation:

The server (Ubuntu, 0.97-29ubuntu21) was running version 0.99.0 with a few local customisations
(namely, a fix for tempfiles in bibindex_engine.py, an additional set of characters added to search_engine.py's
accent stripping and bibformat related changes). We also added a special quote ('\’') to the separators
used by bibindex.

Back then the indexing was running without error.

After updating invenio to version 0.99.1, I decided to reindex from scratch. When running bibindex,
thousands of errors are occurring:


      Error: Cannot put word �lexique with sign 1 for recID 473.



Bibindex eventually fails with the following error:

      Error: Cannot read http://doc.rero.ch/lm.php?url="">
      <urlopen error (4, 'Interrupted system call')>

(lm.php is a php script which manages access to to the fulltext files. Some fulltext are stored internally
in invenio, /opt/cds-invenio/var/data/files/ and some are stored separately and linked to using lm.php)

I assume the two error messages are somehow linked, like for example the bibindex process being
stopped after too many errors, hence the 'Interrupted system call' error.

After reverting back to version 0.99.0, the same errors (Cannot put word ...) are occurring, so imo
the problem must lie elsewhere (I switched to 0.99.1 again afterwards).
Between the last full indexing in 0.99.0 and when I first noticed the
errors happening in 0.99.1, a certain number of installations have been made on the server:

 -installed symfony php framework (http://www.symfony-project.org)
 -changed apache config (added a virtual host)
 -installed a stats tool to display web access stats
 -some other things

(I am not sure whether these are relevant to the problem but I mention it anyway, we never know...)

By looking at the errors, it seems to me that the problem lies somewhere with the encoding of the
characters (either in python or from the text extracted by the conversion tools pdftotext, pstotext, ...).

The error 'Cannot put word' occurs in bibindex_engine.py when trying to store a word, eg.:

      self.value[word] = {recID: sign}

so the word cannot be used as a dictionary key in python.

When looking at the logs, the term list for most of the records contains what I assume are escaped
unicode characters, such as:

     'l\xe2\x80\x99ere', 'd\xe2\x80\x99environ', 'l\xe2\x80\x99immunofluorescence', '\x9clexique'
 
When trying to put such words in the wordtable, we get the 'cannot put word' error.


I am quite puzzled as to the source of the problem (conversion tools, system configuration, ...?),
and I'd really like to fix this so the server can be used in production. Any hints?


Best regards

Reply via email to