Jochen Stärk created PDFBOX-5976: ------------------------------------ Summary: DomXmpParser incorrectly expects namespaces on attribute level Key: PDFBOX-5976 URL: https://issues.apache.org/jira/browse/PDFBOX-5976 Project: PDFBox Issue Type: Bug Components: XmpBox Affects Versions: 3.0.4 PDFBox Reporter: Jochen Stärk Attachments: AN-10005_v28_2025-03-19-2.pdf, AN-10005_v28_2025-03-19x-1.pdf
When trying to determine the PDF-A-Version like {{PDDocument document = null;}} {{try {}} {{document = Loader.loadPDF(new File("AN-10005_v28_2025-03-19.pdf"));}} {{PDDocumentCatalog catalog = document.getDocumentCatalog();}} {{PDMetadata metadata = catalog.getMetadata();}} {{DomXmpParser xmpParser = new DomXmpParser();}} {{XMPMetadata xmp = xmpParser.parse(metadata.createInputStream());}} {{PDFAIdentificationSchema pdfaSchema = xmp.getPDFAIdentificationSchema();}} {{if (pdfaSchema != null) {}} {{System.out.println("It's a PDF A-" + pdfaSchema.getPart());}} {{}}} {{document.close();}} {{} catch (XmpParsingException e) {}} {{e.printStackTrace();}} {{} catch (IOException e) {}} {{e.printStackTrace();}} {{}}} on the attached (and valid) PDF A-3b AN-10005_v28_2025-03-19-2.pdf, PDFBox incorrectly fails with a {{org.apache.xmpbox.xml.XmpParsingException: Schema is not set in this document : http://www.aiim.org/pdfa/ns/id/}} {{ at org.apache.xmpbox.xml.DomXmpParser.checkPropertyDefinition(DomXmpParser.java:920)}} {{ at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRootAttr(DomXmpParser.java:276)}} {{ at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:247)}} {{ at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)}} {{ at de.usegroup.Main.main(Main.java:25)}} After manipulating the metadata stream with itext RuPS from {{<rdf:RDF xmlns:pdf="http://ns.adobe.com/pdf/1.3/" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="" pdfaid:part="3" pdfaid:conformance="B" /><rdf:Description rdf:about="" pdf:Producer="WeasyPrint 64.1" /></rdf:RDF>}} to {{ <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">}} {{ <rdf:Description rdf:about=""}} {{ xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"}} {{ xmlns:pdf="http://ns.adobe.com/pdf/1.3/"}} {{ xmlns:xmp="http://ns.adobe.com/xap/1.0/"}} {{ pdfaid:conformance="B"}} {{ pdfaid:part="3"}} {{ pdf:Producer="WeasyPrint 64.1; modified using iText® Core 7.2.5 (AGPL version) ©2000-2023 iText Group NV"}} {{ xmp:ModifyDate="2025-03-21T08:16:58+01:00"/>}} {{ </rdf:RDF>}} putting the namespace definition in the rdf:Description (AN-10005_v28_2025-03-19x-1.pdf) it works. The issue is: it should be sufficient to put the namespace definitions in the root element, "RDF", i.e. the first example should also work. When searching for similar issues I had the impression this may be similar to your issue #2219 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org