Alexandre Garino created PDFBOX-1995:
----------------------------------------

             Summary: AdobePDFSchema.getProducer() returns empty string
                 Key: PDFBOX-1995
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1995
             Project: PDFBox
          Issue Type: Bug
          Components: XmpBox
    Affects Versions: 1.8.4
            Reporter: Alexandre Garino


I experienced this bug while PDF/A validation process. The document is not 
considered valid because the producer value is not in sync with 
PDDocumentInformation.

{quote}
PDDocumentInformation.getProducer() = ` ' (one space)
AdobePDFSchema.getProducer() = `' (empty)
{quote}

Below the metadata extracted from the PDF document:
 
{quote}
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";>
        <rdf:Description rdf:about="" xmlns:xap="http://ns.adobe.com/xap/1.0/";>
            <xap:CreatorTool>Canon </xap:CreatorTool>
            <xap:CreateDate>2014-01-23T20:09:45+01:00</xap:CreateDate>
        </rdf:Description>
        <rdf:Description rdf:about=""  xmlns:pdf="http://ns.adobe.com/pdf/1.3/";>
            <pdf:Producer> </pdf:Producer>
        </rdf:Description>
        <rdf:Description rdf:about="" 
xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/";>
            <pdfaid:part>1</pdfaid:part>
            <pdfaid:conformance>B</pdfaid:conformance>
        </rdf:Description>
    </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
{quote}

As you can see the Producer value should be equal to ` ' (one space).

The bug is located within the method DomXmpParser.removeComments. This method 
is invoked during the unmarshalling process and removes much more than 
comments, text nodes too! 

I can fix (badly) MY issue by changing the code base from : 

{quote}
                Text t = (Text) node;
                if (t.getTextContent().trim().length() == 0)
                {
                    // XXX is there a better way to remove useless Text ?
                    node.getParentNode().removeChild(node);
                }
{quote}

into : 

{quote}
                Text t = (Text) node;
                if (t.getTextContent().startsWith("\n"))
                {
                    // XXX is there a better way to remove useless Text ?
                    node.getParentNode().removeChild(node);
                }
{quote}

But this is not a long term fix.

IMHO, the unmarshalling process should be reworked.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to