Hi Dominic:
On Wed, 06 Aug 2008, Dominic Lukas Wyler wrote:
> When looking at the logs, the term list for most of the records
> contains what I assume are escaped unicode characters, such as:
>
> 'l\xe2\x80\x99ere', 'd\xe2\x80\x99environ',
> 'l\xe2\x80\x99immunofluorescence', '\x9clexique'
These are not UTF-8 meaningful, e.g. '\x9clexique' is a good example.
Are these terms coming up when indexing the metadata or the full-text?
If from a full-text file, is it PDF or some other format? E.g. if PDF,
then pdftotext should produce quite a clean UTF-8 output ('-enc UTF-8').
Can you send me your input file for testing? BTW, bibindex with verbose
level 9 may be helpful to see what is going on.
> an additional set of characters added to search_engine.py's accent
> stripping and bibformat related changes). We also added a special
> quote ('\'') to the separators used by bibindex.
The problem may also be related to your changes in the accent stripping
and the word breaking procedures, if your changes happen not to be fully
UTF-8 safe. E.g. if they break a word in the middle of a multi-byte
Unicode character. (This is why strip_accents() converts temporarily
its UTF-8 binary string input into a Unicode string before doing the
regexps.) You may want to check your edits from this point of view.
Just wild-guessing what else may have gone wrong if it is not the first
possibility ('badly asciified full-text file')...
Best regards
--
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>