[ https://issues.apache.org/jira/browse/PDFBOX-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185135#comment-14185135 ]
Laurent Richard commented on PDFBOX-2419: ----------------------------------------- By the way, here is our code to workaround until the problem has been fixed in PdfBox {code} public String extractXFDF() { try { @Cleanup PDDocument pdf = PDDocument.load(pdfFileName); pdf.setAllSecurityToBeRemoved(true); PDAcroForm form = pdf.getDocumentCatalog().getAcroForm(); if (form == null) { throw new Pdf2OpxException("PDF file contains no Acroform"); } @Cleanup FDFDocument fdf = form.exportFDF(); @SuppressWarnings("unchecked") List<FDFField> fields = fdf.getCatalog().getFDF().getFields(); sanitize(fields); // cf https://issues.apache.org/jira/browse/PDFBOX-2419 @Cleanup StringWriter writer = new StringWriter(); fdf.saveXFDF(writer); return writer.toString(); } catch (COSVisitorException e) { throw new Pdf2OpxException("exception while extracting XFDF", e); } catch (IOException e) { throw new Pdf2OpxException("exception while reading PDF", e); } } private void sanitize(List<FDFField> fields) throws IOException { if (fields != null) { for (FDFField field : fields) { field.setValue(XmlEscapers.xmlContentEscaper().escape(field.getValue().toString())); sanitize(field.getKids()); } } } {code} The interesting part is in the sanitize method. We use Guava Escapers but it is simply a matter of replacing the three mentioned characters ('<', '>' and '&') by their XML escaped equivalent. It would be better to write correctly each field value directly rather than modifying them recursively afterwards. It means that org.apache.pdfbox.pdmodel.fdf.FDFField.writeXML could be adjusted (even if a better approach would be to avoid using Strings directly in order to write XML). I would have liked to suggest a patch but I was unable to compile PdfBox (maven couldn't resolve dependencies such as com.levigo.jbig2:levigo-jbig2-imageio:jar:1.6.3) > XFDF export is not XML compliant > -------------------------------- > > Key: PDFBOX-2419 > URL: https://issues.apache.org/jira/browse/PDFBOX-2419 > Project: PDFBox > Issue Type: Bug > Components: AcroForm > Affects Versions: 1.8.7 > Reporter: Laurent Richard > Labels: FDF > Fix For: 1.8.8 > > Attachments: SampleForm.pdf > > > The XFDF content is written as a simple string instead of XML nodes. > As a result, field values containing special characters (&, <, >, ...) are > not escaped and the resulting XML is invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)