[jira] [Commented] (PDFBOX-5977) PDFA schema not detected

Tilman Hausherr (Jira) Fri, 04 Apr 2025 12:00:50 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937447#comment-17937447
 ]


Tilman Hausherr commented on PDFBOX-5977:
-----------------------------------------

One thing to do will be to add {{builderFactory.setNamespaceAware(true);}} to 
{{XMLUtil.parse()}} because without it, the namespace information is lost. And 
then some changes in {{MPMetadata.getSchemas()}} at line 645:
{code:java}
else if (attribute.getNamespaceURI() != null && 
         nsMappings.containsKey(attribute.getNamespaceURI()) &&
        name.contains(":"))
{
    Class<?> schemaClass = nsMappings.get(attribute.getNamespaceURI());
    try
    {
        String prefix = name.substring(0, name.indexOf(':'));
        Constructor<?> ctor = schemaClass
                .getDeclaredConstructor(new Class[] { Element.class,
                        String.class });
        retval.add((XMPSchema)ctor.newInstance(new Object[] { schema,
                prefix }));
        found = true;
    }
    catch(NoSuchMethodException e)
    {
        throw new IOException(
                "Error: Class "
                        + schemaClass.getName()
                        + " must have a constructor with the signature of "
                        + schemaClass.getName()
                        + "( org.w3c.dom.Element, java.lang.String )");
    }
    catch(Exception e)
    {
        e.printStackTrace();
        throw new IOException(e.getMessage());
    }
}
{code}
This would have to be refactored because the previous block is very similar, 
but I didn't do it for now to show what the change is about.

There's also an bug in the existing code that the "schema" object is created 
several times instead of only once.

> PDFA schema not detected
> ------------------------
>
>                 Key: PDFBOX-5977
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5977
>             Project: PDFBox
>          Issue Type: Bug
>          Components: JempBox
>    Affects Versions: 1.8.17
>            Reporter: Tilman Hausherr
>            Priority: Major
>
> {code:java}
>         String s = "<?xml version=\"1.0\" encoding=\"UTF-8\" 
> standalone=\"no\"?>\n" +
> "<?xpacket begin=\"\" id=\"W5M0MpCehiHzreSzNTczkc9d\"?><rdf:RDF 
> xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"; 
> xmlns:pdf=\"http://ns.adobe.com/pdf/1.3/\"; 
> xmlns:pdfaid=\"http://www.aiim.org/pdfa/ns/id/\";>\n" +
> " <rdf:Description pdfaid:conformance=\"B\" pdfaid:part=\"3\" 
> rdf:about=\"\"/>\n" +
> " <rdf:Description pdf:Producer=\"WeasyPrint 64.1\" rdf:about=\"\"/>\n" +
> "</rdf:RDF><?xpacket end=\"r\"?>";
>         XMPMetadata xmp = XMPMetadata.load(new 
> ByteArrayInputStream(s.getBytes()));
>         xmp.addXMLNSMapping(XMPSchemaPDFAId.NAMESPACE, XMPSchemaPDFAId.class);
>         XMPSchemaPDFAId schema = (XMPSchemaPDFAId) 
> xmp.getSchemaByClass(XMPSchemaPDFAId.class);
>         System.out.println(schema.getConformance() + " " + schema.getPart());
> {code}
> This fails with an NPE because 
> {{xmp.getSchemaByClass(XMPSchemaPDFAId.class)}} is null.
> While most PDFBox users use xmpbox, some may still use jempbox due to bugs, 
> especially our sister project Apache Tika.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5977) PDFA schema not detected

Reply via email to