Hi,
Thank you for your quick answer.
> Hi Dominic:
>
>On Wed, 06 Aug 2008, Dominic Lukas Wyler wrote:
>> When looking at the logs, the term list for most of the records
>> contains what I assume are escaped unicode characters, such as:
>>
>> 'l\xe2\x80\x99ere', 'd\xe2\x80\x99environ',
>> 'l\xe2\x80\x99immunofluorescence', '\x9clexique'
>
> These are not UTF-8 meaningful, e.g. '\x9clexique' is a good example.
> Are these terms coming up when indexing the metadata or the full-text?
> If from a full-text file, is it PDF or some other format? E.g. if PDF,
> then pdftotext should produce quite a clean UTF-8 output ('-enc UTF-8').
> Can you send me your input file for testing? BTW, bibindex with verbose
> level 9 may be helpful to see what is going on.
I only tested indexing on fulltext as of now, mostly pdf and ps.gz files. According to my tests,
the output of pdftotext is fine.
>> an additional set of characters added to search_engine.py's accent
>> stripping and bibformat related changes). We also added a special
>> quote ('\'') to the separators used by bibindex.
>
> The problem may also be related to your changes in the accent stripping
> and the word breaking procedures, if your changes happen not to be fully
> UTF-8 safe. E.g. if they break a word in the middle of a multi-byte
> Unicode character. (This is why strip_accents() converts temporarily
> its UTF-8 binary string input into a Unicode string before doing the
> regexps.) You may want to check your edits from this point of view.
>
> Just wild-guessing what else may have gone wrong if it is not the first
> possibility ('badly asciified full-text file')...
>
> Best regards
The edits in search_engine for accent stripping involved adding support for iso-8859-2 and iso-8859-15
characters.
To do so, we defined a few additional regexps, for example:
[...]
re_unicode_uppercase_z = re.compile(unicode(r"(?u)[ŹŻŽ]", "utf-8"))
re_unicode_lowercase_z = re.compile(unicode(r"(?u)[źżž]", "utf-8"))
[...]
and then, in strip_accents(), added :
[...]
y = re_unicode_uppercase_z.sub("z", y)
y = re_unicode_lowercase_z.sub("z", y)
[...]
This seems to be working fine. The problem was actually caused by the additional quote we added to
bibindex's alphanumeric separators:
CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS = \!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\’
This quote was actually breaking words in the middle of some unicode characters. The quote in question
is a 3-byte character, eg.: 'L’espace thérapeutique' gives 'L\xe2\x80\x99espace th\xc3\xa9rapeutique'
Removing that character from the list fixed the issue. Thank you very much for your help.
But now, if I want to keep this character as a separator (many of our submitted documents contain such quotes),
I assume I have to proceed as was done with the accent stripping: have the current phrase in
bibindex_engine.get_words_from_phrase() in unicode, as well as all the regexps ?
get_words_from_phrase() would get the words from the phrase and then convert them back to utf-8 ?
So we would have, for example, the regexp converted in unicode:
in bibindex_engine:
re_separators = re.compile(unicode(r"(?u)%s"%CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS,"utf-8"))
and then, in get_words_from_phrase():
[...]
phrase = lower_index_term_unicode(phrase)
[...]
and encode it back to utf-8 before storing it in 'words'
words[alphanumeric_group.encode('utf-8')] = 1
with
def lower_index_term_unicode(term):
return unicode(term, 'utf-8').lower()
which returns the lowered phrase in unicode instead of utf-8.
I'll be doing a few tests with this, as it would be useful for us to be able to add such characters
as separators.
Thanks again,
Best regards
