#748: wash_for_utf8 breaks sometimes when fed an already-unicode string
-----------------------+------------------------------
  Reporter:  jblayloc  |      Owner:  jlavik
      Type:  defect    |     Status:  assigned
  Priority:  blocker   |  Milestone:
 Component:  MiscUtil  |    Version:
Resolution:            |   Keywords:  INSPIRE DEPLOYED
-----------------------+------------------------------
Changes (by simko):

 * priority:  major => blocker
 * status:  in_merge => assigned


Comment:

 Actually, it is not correct to amend the docstring of
 `wash_for_utf8()` to say that it returns UTF-8 encoded string with
 incorrect characters stripped, while at the same time to amend it to
 return Unicode strings unchanged.  Looking at the function, it assumes
 that its input parameter `text` is UTF-8 binary string already, so it
 simply checks for compliance and removes any extra binary characters.
 The function would fail should it receive UTF-16 encoded binary
 string, for example.  So we should amend either the docstring or the
 behaviour of the function to cover these non-UTF-8 input cases better.
 For example, we could amend it to accept either UTF-8 binary strings
 or Unicode strings (so UTF-16 would still lead to error), while it
 would return UTF-8 binary strings in both input cases.  But then it
 would not be only "washing", it would be also "converting"...

 Moreover, calling this function for a Unicode input string, while the
 function usually expects UTF-8 binary string, may lead to masking
 troubles with encode/decode in our code base.  (Kind of similar to
 calling `cgi.escape(x)` for an already escaped argument.)  One can
 argue that since this function is used to wash UTF-8 binary strings,
 why should it be called for Unicode strings in the first place?

 It is therefore possible that the problem at hand is better tackled at
 the source place somewhere else, not at the level of this UTF-8
 washing.  Why exactly was this change needed?

 (Putting the ticket status back to "assigned", and raising its
 priority to "blocker", since this patch was already deployed on
 INSPIRE.)

-- 
Ticket URL: <http://invenio-software.org/ticket/748#comment:6>
Invenio <http://invenio-software.org>

Reply via email to