CVS Commit Overview for 2006-08-09
==================================
2006-08-09 Nicholas Robinson <[email protected]>
* modules/bibindex/lib/bibindex_engine.py: Fixed a bug relating to
the indexation of fulltexts: When the contents of a PDF fulltext
are to be indexed, the tool "pdftotext" is used to convert the PDF
to plain text. The plaintext should be utf-8 so that search_engine
(strip_accents) can replace accented letters with their
non-accented cousins. However, pdftotext outputs by default
latin-1, so no accented letters could be replaced and were kept and
used in the fulltext word index, meaning that if you seached for a
word containing accents, within a fulltext, you would never have
any results, unless the non-accented "version" of that word also
existed in the document. [E.g. searching for "sp?r" would only
return results for documents containing "spater" because search
engine strips the accent in the search query, meaning that the
query can never match the accented word in the fulltext word
index.] The problem was fixed by calling pdftotext with its "-enc
UTF-8" argument.
--
CDS Invenio Developers <[email protected]>