[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Tim Allison (JIRA) Mon, 20 Jul 2015 12:39:56 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633986#comment-14633986
 ]


Tim Allison edited comment on TIKA-1678 at 7/20/15 7:38 PM:
------------------------------------------------------------

That works perfectly.  Thank you, [~tilman]!

Now I'm trying to generate a test file with code lifted from the writing a PDFA 
example in PDFBox:
{noformat}
        PDDocument doc = new PDDocument();
        try {

            PDPage page = new PDPage();
            doc.addPage(page);
            PDFont font = PDType0Font.load(doc, 
this.getClass().getResourceAsStream("/test-documents/testTrueType3.ttf"));
            String message = "This file contains an octal encoded string in the 
title field of the xmp";
            // create a page with the message
            PDPageContentStream contents = new PDPageContentStream(doc, page);
            contents.beginText();
            contents.setFont(font, 12);
            contents.newLineAtOffset(100, 700);
            contents.showText(message);
            contents.endText();
            contents.saveGraphicsState();
            contents.close();

            // add XMP metadata
            XMPMetadata xmp = XMPMetadata.createXMPMetadata();
            try {

                DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
                dc.setTitle("this is the title");

                XmpSerializer serializer = new XmpSerializer();
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                serializer.serialize(xmp, baos, true);
                System.out.println(StringUtil.fromStream(new 
ByteArrayInputStream(baos.toByteArray())));
                PDMetadata metadata = new PDMetadata(doc);
                metadata.importXMPMetadata(baos.toByteArray());
                doc.getDocumentCatalog().setMetadata(metadata);
            } catch (TransformerException e) {
                e.printStackTrace();
            }
            doc.save(outputPath.toFile());
        } finally {
            doc.close();
        }
{noformat}

The resulting file opens in Adobe and shows the text.  The "title" field is 
empty (not sure if Adobe reads only from the internal metadata instead of the 
xmp?), and I'm getting null when I try to read the title value from the XMP 
with PDFBox.  This is all 2.0.0 trunk.

What am I doing wrong?  Thank you, again.


was (Author: [email protected]):
That works perfectly.  Thank you, [~tilman]!

Now I'm trying to generate a test file with code lifted from the writing a PDFA 
example in PDFBox:
{noformat}
        PDDocument doc = new PDDocument();
        try {

            PDPage page = new PDPage();
            doc.addPage(page);
            PDFont font = PDType0Font.load(doc, 
this.getClass().getResourceAsStream("/test-documents/testTrueType3.ttf"));
            String message = "This file contains an octal encoded string in the 
title field of the xmp";
            // create a page with the message
            PDPageContentStream contents = new PDPageContentStream(doc, page);
            contents.beginText();
            contents.setFont(font, 12);
            contents.newLineAtOffset(100, 700);
            contents.showText(message);
            contents.endText();
            contents.saveGraphicsState();
            contents.close();

            // add XMP metadata
            XMPMetadata xmp = XMPMetadata.createXMPMetadata();
            try {

                DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
                dc.setTitle("this is the title");

                XmpSerializer serializer = new XmpSerializer();
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                serializer.serialize(xmp, baos, true);
                System.out.println(StringUtil.fromStream(new 
ByteArrayInputStream(baos.toByteArray())));
                PDMetadata metadata = new PDMetadata(doc);
                metadata.importXMPMetadata(baos.toByteArray());
                doc.getDocumentCatalog().setMetadata(metadata);
            } catch (TransformerException e) {
                e.printStackTrace();
            }
            doc.save(outputPath.toFile());
        } finally {
            doc.close();
        }
{noformat}

The resulting file opens in Adobe and shows the text.  The "title" field is 
empty (not sure if Adobe reads only from the internal metadata instead of the 
xmp?), and I'm getting null when I try to read the title value with PDFBox.  
This is all 2.0.0 trunk.

What am I doing wrong?  Thank you, again.

> PDF metadata extraction fails to spot UTF-16 encoded title
> ----------------------------------------------------------
>
>                 Key: TIKA-1678
>                 URL: https://issues.apache.org/jira/browse/TIKA-1678
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.9
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> <</Type/Metadata
> /Subtype/XML/Length 1978>>stream
> <?xpacket begin='ï»¿' id='W5M0MpCehiHzreSzNTczkc9d'?>
> <?adobe-xap-filters esc="CRLF"?>
> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
> 1.6'>
> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
> xmlns:iX='http://ns.adobe.com/iX/1.0/'>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
> <xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
> <xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li 
> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end='w'?>
> endstream
> endobj
> 2 0 obj
> <</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Reply via email to