RE: bibindex on fulltext ending with error

Dominic Lukas Wyler Thu, 7 Aug 2008 10:37:28 +0200

Title: RE: bibindex on fulltext ending with error <urlopen error (4, 'Interrupted system call')>

Hi,

Thank you for your quick answer.

> Hi Dominic:
>
>On Wed, 06 Aug 2008, Dominic Lukas Wyler wrote:
>> When looking at the logs, the term list for most of the records
>> contains what I assume are escaped unicode characters, such as:
>>
>> 'l\xe2\x80\x99ere', 'd\xe2\x80\x99environ',
>> 'l\xe2\x80\x99immunofluorescence', '\x9clexique'
>
> These are not UTF-8 meaningful, e.g. '\x9clexique' is a good example.
> Are these terms coming up when indexing the metadata or the full-text?
> If from a full-text file, is it PDF or some other format? E.g. if PDF,
> then pdftotext should produce quite a clean UTF-8 output ('-enc UTF-8').
> Can you send me your input file for testing? BTW, bibindex with verbose
> level 9 may be helpful to see what is going on.

I only tested indexing on fulltext as of now, mostly pdf and ps.gz files. According to my tests,
the output of pdftotext is fine.

>> an additional set of characters added to search_engine.py's accent
>> stripping and bibformat related changes). We also added a special
>> quote ('\'') to the separators used by bibindex.
>
> The problem may also be related to your changes in the accent stripping
> and the word breaking procedures, if your changes happen not to be fully
> UTF-8 safe. E.g. if they break a word in the middle of a multi-byte
> Unicode character. (This is why strip_accents() converts temporarily
> its UTF-8 binary string input into a Unicode string before doing the
> regexps.) You may want to check your edits from this point of view.
>
> Just wild-guessing what else may have gone wrong if it is not the first
> possibility ('badly asciified full-text file')...
>
> Best regards

The edits in search_engine for accent stripping involved adding support for iso-8859-2 and iso-8859-15
characters.

To do so, we defined a few additional regexps, for example:

    [...]
    re_unicode_uppercase_z = re.compile(unicode(r"(?u)[ŹŻŽ]", "utf-8"))
    re_unicode_lowercase_z = re.compile(unicode(r"(?u)[źżž]", "utf-8"))
    [...]

and then, in strip_accents(), added :

    [...]
    y = re_unicode_uppercase_z.sub("z", y)
    y = re_unicode_lowercase_z.sub("z", y)
    [...]

This seems to be working fine. The problem was actually caused by the additional quote we added to
bibindex's alphanumeric separators:

CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS = \!\"\#\$\%\&\'\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\’

This quote was actually breaking words in the middle of some unicode characters. The quote in question
is a 3-byte character, eg.: 'L’espace thérapeutique' gives 'L\xe2\x80\x99espace th\xc3\xa9rapeutique'

Removing that character from the list fixed the issue. Thank you very much for your help.

But now, if I want to keep this character as a separator (many of our submitted documents contain such quotes),
I assume I have to proceed as was done with the accent stripping: have the current phrase in
bibindex_engine.get_words_from_phrase() in unicode, as well as all the regexps ?
get_words_from_phrase() would get the words from the phrase and then convert them back to utf-8 ?

So we would have, for example, the regexp converted in unicode:

in bibindex_engine:

re_separators = re.compile(unicode(r"(?u)%s"%CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS,"utf-8"))

and then, in get_words_from_phrase():
    [...]
    phrase = lower_index_term_unicode(phrase)
    [...]

and encode it back to utf-8 before storing it in 'words'

words[alphanumeric_group.encode('utf-8')] = 1

with

def lower_index_term_unicode(term):

return unicode(term, 'utf-8').lower()

which returns the lowered phrase in unicode instead of utf-8.

I'll be doing a few tests with this, as it would be useful for us to be able to add such characters
as separators.

Thanks again,

Best regards

RE: bibindex on fulltext ending with error

Reply via email to