CVS Commit Overview for 2006-08-09
==================================

2006-08-09  Nicholas Robinson <[email protected]>

        * modules/bibindex/lib/bibindex_engine.py: Fixed a bug relating to
        the indexation of fulltexts: When the contents of a PDF fulltext
        are to be indexed, the tool "pdftotext" is used to convert the PDF
        to plain text. The plaintext should be utf-8 so that search_engine
        (strip_accents) can replace accented letters with their
        non-accented cousins.  However, pdftotext outputs by default
        latin-1, so no accented letters could be replaced and were kept and
        used in the fulltext word index, meaning that if you seached for a
        word containing accents, within a fulltext, you would never have
        any results, unless the non-accented "version" of that word also
        existed in the document. [E.g. searching for "sp?r" would only
        return results for documents containing "spater" because search
        engine strips the accent in the search query, meaning that the
        query can never match the accented word in the fulltext word
        index.]  The problem was fixed by calling pdftotext with its "-enc
        UTF-8" argument.

-- 
CDS Invenio Developers <[email protected]>


Reply via email to