[jira] [Created] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded data

Andrew Jackson (JIRA) Tue, 14 Jul 2015 08:27:37 -0700

Andrew Jackson created TIKA-1678:
------------------------------------

             Summary: PDF metadata extraction fails to spot UTF-16 encoded data
                 Key: TIKA-1678
                 URL: https://issues.apache.org/jira/browse/TIKA-1678
             Project: Tika
          Issue Type: Bug
          Components: metadata
    Affects Versions: 1.9
            Reporter: Andrew Jackson
            Priority: Minor



When extracting metadata from PDFs, we see some odd behaviour for a minority of 
the documents. The PDF metadata can be encoded as UTF-18 octets, but is not 
always being decoded as such.

A specific example is here: 
http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf

Which contains this (literal file content):

{noformat}
443 0 obj
<</Type/Metadata
/Subtype/XML/Length 1978>>stream
<?xpacket begin='ï»¿' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?adobe-xap-filters esc="CRLF"?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
1.6'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
xmlns:iX='http://ns.adobe.com/iX/1.0/'>
<rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
\000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
\000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
\000E\000d\000i\000t\000i\000o\000n'/>
<rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
<xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
<xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
<rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
<rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
xmlns:dc='http://purl.org/dc/elements/1.1/' 
dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li 
xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
</rdf:RDF>
</x:xmpmeta>


<?xpacket end='w'?>
endstream
endobj
2 0 obj
<</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 \000P\000D\000F\000 
\000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
\000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000 
\000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
\000E\000d\000i\000t\000i\000o\000n)
/CreationDate(D:20120718153801+01'00')
/ModDate(D:20120718153801+01'00')
/Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
/Author(\376\377\000T\000e\000t\000t\000i)>>endobj
{noformat} 

Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an error, 
but the ones encoded in the actual PDF metadata fields should be extracted 
accurately.

When extracted, we get:
{noformat}
...
dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
meta:author: \376\377\000T\000e\000t\000t\000i
meta:author: Tetti
...
{noformat}

So, the author appears to be decoded correctly once, but the title is not. Is 
the XML dc:title being used to override the PDF title field? Or is one of the 
title fields being decoded incorrectly?

(I accept that although this is a real PDF document from the web, it is also a 
malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded data

Reply via email to