#748: wash_for_utf8 breaks sometimes when fed an already-unicode string
-----------------------+------------------------------
  Reporter:  jblayloc  |      Owner:  jblayloc
      Type:  defect    |     Status:  assigned
  Priority:  blocker   |  Milestone:
 Component:  MiscUtil  |    Version:
Resolution:            |   Keywords:  INSPIRE DEPLOYED
-----------------------+------------------------------

Comment (by simko):

 Replying to [comment:8 jblayloc]:
 > I don't understand your question. As I understand it, all UTF8 strings
 are Unicode strings, but not vice-versa.

 Let me elaborate then.  In Python 2, there are two distinct types of
 strings, so to speak, Unicode (character) strings and binary (byte)
 strings.  In Invenio, we are using typically UTF-8 binary strings in
 I/O tiers: they get stored into DB, they get read from or written into
 the web interface, etc.  When necessary, we transform them into
 Unicode strings inside the application business logic for
 manipulations that require Unicode string objects instead of binary
 string objects.  For an example, see `wash_index_term()`.

 Now the reason for the existence of `wash_for_utf8()` is that
 sometimes the UTF-8 binary strings at the I/O times may be corrupted.
 For example, maybe our call to `pdftotext` did not return fully UTF-8
 valid output and there was some gibberish.  In these cases, we may try
 to do what we can with these ill-defined input strings to "wash away"
 problematic bytes from UTF-8 binary string, while retaining the good
 ones, so that the incoming string would be as untouched as possible
 while being fully UTF-8 valid.  This is what `wash_for_utf8()`
 function attempts to do.

 From this point of view, it is not necessary to amend this function so
 that it would work on Unicode strings too -- simply because Unicode
 strings can be converted into valid UTF-8 binary strings anytime by
 `encode('utf-8')`.  It should not be needed to wash anything in these
 cases, since our business logic works with valid Unicode strings
 already.

 Hence my "masking worries", notably that some component of the Invenio
 business logic probably calls UTF-8 washing function in an
 inappropriate way, akin to escaping an already escaped argument, so to
 speak.  If this part of the Invenio application already transformed
 UTF-8 binary string into Unicode string, then why is there a need to
 call `wash_for_utf8()`?  And if this part of the application tries to
 perform some I/O washing upon Unicode strings, then shouldn't these
 have been converted into UTF-8 binary strings previously?  Trying to
 amend `wash_for_utf8()` to accept also Unicode string arguments could
 "mask" problems of the kind that some code base component expects
 UTF-8 binary string as its input, but some other component passes it
 Unicode strings instead.  We should avoid cases like that.
 (Alternatively, and perhaps ideally, we could work everywhere in our
 application logic code base with Unicode strings only, and transform
 them to UTF-8 binary strings at the I/O times only, but this is not
 the state of the things right now.)

 So this is why I wanted to understand why this change was needed and
 whether we shouldn't rather try to isolate the part of the code that
 was apparently calling for this amendment of the UTF-8 washing
 function, in order to check whether we shouldn't solve the issue "at
 the source component" level and not inside `wash_for_utf8()` level.
 The `wash_for_utf8()` function was designed primarily to help with
 washing of potentially bad UTF-8 input/output values; it should not be
 needed to work with Unicode string objects.

 > The remainder of the patch after the first two lines is (strictly
 speaking) unnecessary optimization because Piotr was offended by the
 inefficiency of what he'd written and so sat with me and rewrote the
 function before he'd let me deploy it anywhere.

 We can commit this part independently as an optimisation :)

-- 
Ticket URL: <http://invenio-software.org/ticket/748#comment:9>
Invenio <http://invenio-software.org>

Reply via email to