#748: wash_for_utf8 breaks sometimes when fed an already-unicode string
-----------------------+------------------------------
Reporter: jblayloc | Owner: jblayloc
Type: defect | Status: assigned
Priority: blocker | Milestone:
Component: MiscUtil | Version:
Resolution: | Keywords: INSPIRE DEPLOYED
-----------------------+------------------------------
Comment (by simko):
Replying to [comment:8 jblayloc]:
> I don't understand your question. As I understand it, all UTF8 strings
are Unicode strings, but not vice-versa.
Let me elaborate then. In Python 2, there are two distinct types of
strings, so to speak, Unicode (character) strings and binary (byte)
strings. In Invenio, we are using typically UTF-8 binary strings in
I/O tiers: they get stored into DB, they get read from or written into
the web interface, etc. When necessary, we transform them into
Unicode strings inside the application business logic for
manipulations that require Unicode string objects instead of binary
string objects. For an example, see `wash_index_term()`.
Now the reason for the existence of `wash_for_utf8()` is that
sometimes the UTF-8 binary strings at the I/O times may be corrupted.
For example, maybe our call to `pdftotext` did not return fully UTF-8
valid output and there was some gibberish. In these cases, we may try
to do what we can with these ill-defined input strings to "wash away"
problematic bytes from UTF-8 binary string, while retaining the good
ones, so that the incoming string would be as untouched as possible
while being fully UTF-8 valid. This is what `wash_for_utf8()`
function attempts to do.
From this point of view, it is not necessary to amend this function so
that it would work on Unicode strings too -- simply because Unicode
strings can be converted into valid UTF-8 binary strings anytime by
`encode('utf-8')`. It should not be needed to wash anything in these
cases, since our business logic works with valid Unicode strings
already.
Hence my "masking worries", notably that some component of the Invenio
business logic probably calls UTF-8 washing function in an
inappropriate way, akin to escaping an already escaped argument, so to
speak. If this part of the Invenio application already transformed
UTF-8 binary string into Unicode string, then why is there a need to
call `wash_for_utf8()`? And if this part of the application tries to
perform some I/O washing upon Unicode strings, then shouldn't these
have been converted into UTF-8 binary strings previously? Trying to
amend `wash_for_utf8()` to accept also Unicode string arguments could
"mask" problems of the kind that some code base component expects
UTF-8 binary string as its input, but some other component passes it
Unicode strings instead. We should avoid cases like that.
(Alternatively, and perhaps ideally, we could work everywhere in our
application logic code base with Unicode strings only, and transform
them to UTF-8 binary strings at the I/O times only, but this is not
the state of the things right now.)
So this is why I wanted to understand why this change was needed and
whether we shouldn't rather try to isolate the part of the code that
was apparently calling for this amendment of the UTF-8 washing
function, in order to check whether we shouldn't solve the issue "at
the source component" level and not inside `wash_for_utf8()` level.
The `wash_for_utf8()` function was designed primarily to help with
washing of potentially bad UTF-8 input/output values; it should not be
needed to work with Unicode string objects.
> The remainder of the patch after the first two lines is (strictly
speaking) unnecessary optimization because Piotr was offended by the
inefficiency of what he'd written and so sat with me and rewrote the
function before he'd let me deploy it anywhere.
We can commit this part independently as an optimisation :)
--
Ticket URL: <http://invenio-software.org/ticket/748#comment:9>
Invenio <http://invenio-software.org>