Re: [Invenio] #748: wash_for_utf8 breaks sometimes when fed an already-unicode string

Henning Weiler Wed, 07 Sep 2011 01:11:39 -0700

Dear all,

>> The place this was being called from, though, was
>> perform_request_search.  Sam and Henning didn't call the washer
>> directly; they are taking strings out of the database and handing them
>> to p_r_s directly without intermediation.
> 
> So the question is how come that Unicode string u'G\xf6ppert' was passed
> to p_r_s() in such a format, when it is known that p_r_s() accepts only
> binary strings in the first place?  u'G\xf6ppert' cannot be read from
> the database in this format, because run_sql() returns binary strings,
> not Unicode strings.  So there must have been some man-in-the-middle
> component that massaged the object returned by the database into a
> Unicode string.  (Or a component that loaded serialised blob that
> contained Unicode strings, or something.)  This is why I'm trying to
> understand the `root cause' of the problem, because the component in
> question should know that p_r_s() is callable with binary strings only,
> and so it should not pass Unicode strings instead.  Do you still have
> the original exception with stack frame details so that we could see
> which part of the code was involved here?
> 
> (Because if there was no intermediation, as you said, then this would
> mean that MySQLdb somehow returned Unicode strings out of the database,
> in spite of dbquery's use of `use_unicode=False', which would indicate a
> deeper problem on the MySQLdb level.  This would be surprising.)


Personally, I think it is our fault! Apologies. We follow an approach to use 
Unicode
strings throughout bibauthorid. The reason is that we are doing massive 
string comparisons and string operations in many places. These are partially 
done by tools that only support Unicode input.

To ensure that we always work with Unicode strings, we decode early and encode 
late.
Where early means right at the time when we retrieve data from the db: 
.decode('utf-8')
And late when we're about to write back to the db: .encode('utf-8')

>> Well, as far as I can tell, all of the callers are doing things that
>> seem reasonable, it's just our data is dirty.
> 
> Not sure about this.  OT1H, if our binary data is dirty, then this
> should be already properly taken care of by the current implementation
> of wash_for_utf8(), which turns dirty binary strings into valid UTF-8
> binary strings.  It would be no help to enrich wash_for_utf8() function
> with Unicode type checking here.  OTOH, strings like u'G\xf6ppert' are
> not dirty, but perfectly valid Unicode strings.  So it seems the problem
> is not due to some broken dirty data, but rather due to internal data
> mishandling, such as passing Unicode strings where binary strings were
> expected, and vice versa.

So it certainly is bibauthorid that wrongly call p_r_s() with Unicode strings. 
Unfortunately, the 
documentation of p_r_s() does not explicitly warn about this issue and does not 
specify parameter types concerning the encoding of strings. We might want to 
either
enrich the docstring with a warning, or train it to correctly handle 
already-encoded
strings. Please let us know if we should change our callers of p_r_s().

Cheers,
 Henning

Re: [Invenio] #748: wash_for_utf8 breaks sometimes when fed an already-unicode string

Reply via email to