#559: BibUpload: Cannot bibupload file containing UTF-8 chars
--------------------------+------------------------
  Reporter:  grfavre      |      Owner:
      Type:  enhancement  |     Status:  new
  Priority:  major        |  Milestone:  v1.0
 Component:  WebSubmit    |    Version:
Resolution:               |   Keywords:  bibdocfile
--------------------------+------------------------
Changes (by grfavre):

 * keywords:   => bibdocfile
 * priority:  critical => major
 * component:  BibUpload => WebSubmit
 * type:  defect => enhancement


Comment:

 I finally found out the solution.
 For some reason, bibupload checks stuff using bibdocfile: it  adds
 comments and descriptions to the MARC using the get_description() and
 get_comment() functions. These functions retrieve content pickled in a
 blob in the database (this is real bad design, sorry guys, a database is
 by no way meant to contain language specific stuff.).

 As no escaping is made on the content initially passed to set_comment or
 set_definition, it will then crash when building MARC if this content was
 a unicode object rather than an encoded string.

 The solution I used was to re-encode all descriptions:
 {{{
 from invenio.dbquery import run_sql
 from invenio.bibdocfile import BibRecDocs

 recids = run_sql("select id_bibrec from bibrec_bibdoc")

 def stringize(str_like, default='n/a'):
     if type(str_like) == str:
         return str_like
     if type(str_like) == unicode:
         return str_like.encode('utf-8')
     elif type(str_like) == type(None):
         return default
     else:
         raise ValueError


 for (recid,) in recids:
     archive = BibRecDocs(recid)
     for bibdoc in archive.bibdocs:
         for bfile in bibdoc.list_all_files():
             description = stringize(bfile.get_description())
             bibdoc.set_description(description, bfile.get_format(),
 bfile.get_version())
 }}}

 The simplest solution would be to check string-like objects before storing
 them in the database. One should modify bibdocfile => BibDocMoreInfo and
 make it escape content before storing it.

 This problem would never have happened if values were stored in a SQL
 field (which is already encoded by the database). The best possible
 solution would be to store such content directly in the tables. Anyway,
 this effort would cost slightly more in development time (modifications of
 the API, tests an migration kits)...

-- 
Ticket URL: <http://invenio-software.org/ticket/559#comment:12>
Invenio <http://invenio-software.org>

Reply via email to