[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Bob Paulin (JIRA) Fri, 24 Jul 2015 06:10:42 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640423#comment-14640423
 ]


Bob Paulin commented on TIKA-1678:
----------------------------------

Hi [[email protected]] before putting the Tika class in a different package 
could you subclass the PDFBox BaseParser with a Tika Class?  Protected methods 
are also available to Classes that extend them so it might be better to take 
this approach than aligning the packages.  

If you need to use split package there are directives that allow you to merge 
them and generally the ordering is only really a big deal when there may be 
class collisions.  But I think by using subclassing you might remove the split 
package completely.

> PDF metadata extraction fails to spot UTF-16 encoded title
> ----------------------------------------------------------
>
>                 Key: TIKA-1678
>                 URL: https://issues.apache.org/jira/browse/TIKA-1678
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.9
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> <</Type/Metadata
> /Subtype/XML/Length 1978>>stream
> <?xpacket begin='ï»¿' id='W5M0MpCehiHzreSzNTczkc9d'?>
> <?adobe-xap-filters esc="CRLF"?>
> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
> 1.6'>
> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
> xmlns:iX='http://ns.adobe.com/iX/1.0/'>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
> <xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
> <xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li 
> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end='w'?>
> endstream
> endobj
> 2 0 obj
> <</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Reply via email to