Re: [Invenio] #748: wash_for_utf8 breaks sometimes when fed an already-unicode string

Tibor Simko Tue, 06 Sep 2011 13:58:20 -0700

On Mon, 15 Aug 2011, Joe Blaylock wrote:
>>  In these cases, we may try to do what we can with these ill-defined
>>  input strings to "wash away" problematic bytes from UTF-8 binary
>>  string, while retaining the good ones, so that the incoming string
>>  would be as untouched as possible while being fully UTF-8 valid.
>>  This is what `wash_for_utf8()` function attempts to do.
>
> Isn't this what unicode(str, 'utf8', errors='ignore') does?


Yes, it is, and I think we can safely switch to such an implementation,
unless there are some other use cases that Piotr was having in mind
behind his implementation?

> The place this was being called from, though, was
> perform_request_search.  Sam and Henning didn't call the washer
> directly; they are taking strings out of the database and handing them
> to p_r_s directly without intermediation.

So the question is how come that Unicode string u'G\xf6ppert' was passed
to p_r_s() in such a format, when it is known that p_r_s() accepts only
binary strings in the first place?  u'G\xf6ppert' cannot be read from
the database in this format, because run_sql() returns binary strings,
not Unicode strings.  So there must have been some man-in-the-middle
component that massaged the object returned by the database into a
Unicode string.  (Or a component that loaded serialised blob that
contained Unicode strings, or something.)  This is why I'm trying to
understand the `root cause' of the problem, because the component in
question should know that p_r_s() is callable with binary strings only,
and so it should not pass Unicode strings instead.  Do you still have
the original exception with stack frame details so that we could see
which part of the code was involved here?

(Because if there was no intermediation, as you said, then this would
mean that MySQLdb somehow returned Unicode strings out of the database,
in spite of dbquery's use of `use_unicode=False', which would indicate a
deeper problem on the MySQLdb level.  This would be surprising.)

> Couldn't we just change the name of wash_for_utf8() to give_me_utf8()
> and change the
> {{{
>         return text
> }}}
> on line 352 of my branch to
> {{{
>         return text.encode('utf8')
> }}}
> ?

Yes, that's the possibility I mentioned in ticket:748#comment:6 under
the `converting' label.  Things would be clean that way, and I'd be
perfectly happy to merge it like that.  As I mentioned during one of our
previous telecons, I was originally going to amend the patch, but then I
stopped and preferred to ask you for details, because it's valuable to
make triple sure that we don't mask some deeper double-encoding troubles
or something.  It seems we are not masking anything, it seems we are
only passing Unicode strings to p_r_s() while we shouldn't, which is
something that the `convert' way of doing things would nicely take care
of.  So unless concrete tracebacks (that I've not seen) indicate
otherwise, or unless you see something else in the picture, I'll merge
your patch in this `convert' way, and we should be OK.

> Well, as far as I can tell, all of the callers are doing things that
> seem reasonable, it's just our data is dirty.

Not sure about this.  OT1H, if our binary data is dirty, then this
should be already properly taken care of by the current implementation
of wash_for_utf8(), which turns dirty binary strings into valid UTF-8
binary strings.  It would be no help to enrich wash_for_utf8() function
with Unicode type checking here.  OTOH, strings like u'G\xf6ppert' are
not dirty, but perfectly valid Unicode strings.  So it seems the problem
is not due to some broken dirty data, but rather due to internal data
mishandling, such as passing Unicode strings where binary strings were
expected, and vice versa.

Best regards
-- 
Tibor Simko

Re: [Invenio] #748: wash_for_utf8 breaks sometimes when fed an already-unicode string

Reply via email to