#748: wash_for_utf8 breaks sometimes when fed an already-unicode string
-----------------------+------------------------------
Reporter: jblayloc | Owner: jlavik
Type: defect | Status: assigned
Priority: blocker | Milestone:
Component: MiscUtil | Version:
Resolution: | Keywords: INSPIRE DEPLOYED
-----------------------+------------------------------
Changes (by simko):
* priority: major => blocker
* status: in_merge => assigned
Comment:
Actually, it is not correct to amend the docstring of
`wash_for_utf8()` to say that it returns UTF-8 encoded string with
incorrect characters stripped, while at the same time to amend it to
return Unicode strings unchanged. Looking at the function, it assumes
that its input parameter `text` is UTF-8 binary string already, so it
simply checks for compliance and removes any extra binary characters.
The function would fail should it receive UTF-16 encoded binary
string, for example. So we should amend either the docstring or the
behaviour of the function to cover these non-UTF-8 input cases better.
For example, we could amend it to accept either UTF-8 binary strings
or Unicode strings (so UTF-16 would still lead to error), while it
would return UTF-8 binary strings in both input cases. But then it
would not be only "washing", it would be also "converting"...
Moreover, calling this function for a Unicode input string, while the
function usually expects UTF-8 binary string, may lead to masking
troubles with encode/decode in our code base. (Kind of similar to
calling `cgi.escape(x)` for an already escaped argument.) One can
argue that since this function is used to wash UTF-8 binary strings,
why should it be called for Unicode strings in the first place?
It is therefore possible that the problem at hand is better tackled at
the source place somewhere else, not at the level of this UTF-8
washing. Why exactly was this change needed?
(Putting the ticket status back to "assigned", and raising its
priority to "blocker", since this patch was already deployed on
INSPIRE.)
--
Ticket URL: <http://invenio-software.org/ticket/748#comment:6>
Invenio <http://invenio-software.org>