#640: WebSubmit (plugin wsm_pdftk_plugin) - loss of metadata while it's being
processed
-------------------------------------------+-----------------
 Reporter:  jpcorral                       |      Owner:
     Type:  defect                         |     Status:  new
 Priority:  major                          |  Milestone:
Component:  WebSubmit                      |    Version:
 Keywords:  WebSubmit plugin pdf metadata  |
-------------------------------------------+-----------------
 This bug appears when there is more than one "custom key" with the same
 name in the metadata of a PDF. A "custom key" is the name for those keys
 that they are not "InfoKey" or "InfoValue".

 Let's take the PDF of this e-proceedings
 http://cdsweb.cern.ch/record/1090859, as example. It has a long Table of
 contents and if the plugin is used to extract the metadata:
 {{{
 from invenio.websubmit_file_metadata_plugins import wsm_pdftk_plugin
 # The second parameter is mandatory but it never used
 wsm_pdftk_plugin.read_metadata_local('/path/to/the/file/care-
 conf-06-049.pdf', 0)
 }}}

 This information is got:
 {{{
 {'BookmarkLevel': '1',
  'BookmarkPageNumber': '297',
  'BookmarkTitle': 'Session 8b_McIntyre_slides4.pdf',
  'CreationDate': "D:20070312163531+01'00'",
  'Creator': 'Adobe Acrobat 7.0',
  'ModDate': "D:20070314164628+01'00'",
  'NumberOfPages': '305',
  'PdfID0': '35ef8d4d0af11db8788011242e3266',
  'PdfID1': '2c54896bd24311db8788011242e3266',
  'Producer': 'Mac OS X 10.4.8 Quartz PDFContext'}
 }}}

 But if the command {{{pdftk /path/to/the/file/care-conf-06-049.pdf dump-
 data | less}}} is used, all metadata is extracted.

 These "customs keys" are keys of a dictionary (line 98 of the plugin), so
 the value of the those keys is always overwritten when a new line with the
 same key appears.

 In this example, only the last entrance of the Table of contents of this
 PDF is retrieved:
 {{{
 {'BookmarkLevel': '1',
  'BookmarkPageNumber': '297',
  'BookmarkTitle': 'Session 8b_McIntyre_slides4.pdf',
  [...]
 }
 }}}

-- 
Ticket URL: <http://invenio-software.org/ticket/640>
Invenio <http://invenio-software.org>

Reply via email to