On Tue, 01 Jul 2014, Miguel Martín wrote: > Why is this happening? Why do I get search results with term > "ingeniería" and not "emérito"? Is this error related to accents?
The first thing to check is whether the extraction of text from your PDF source file worked well. Just find where the document lives, for example which document IDs are attached to record 13793: $ bibdocfile --get-info --recid 13793 which will show you document file location like: /opt/invenio/var/data/files/g0/78/ then look at the extracted text file: /opt/invenio/var/data/files/g0/78/.text\;1 to see what the extracted text contains. The extraction typically uses this procedure: pdftotext -q -enc UTF-8 which may work with less or more success, depending on your source PDF file and on your pdftotext version. Secondly, if you find both words "ingeniería" and "emérito" well present in that text file, as independent words in proper UTF-8 version, then this would indicate that the extraction phase worked well and that the problem may be related to accent treatment, as you suspect. (Otherwise all Latin-1 accents are treated the same, so if you find differences, you can look for things like word breaks, unusual leading/trailing quotes instead of spaces, etc. Or simply look for another work in that PDF containing "é" to see whether it is findable.) Can you please check and report back? Best regards -- Tibor Simko

