[ 
https://issues.apache.org/jira/browse/PDFBOX-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185135#comment-14185135
 ] 

Laurent Richard commented on PDFBOX-2419:
-----------------------------------------

By the way, here is our code to workaround until the problem has been fixed in 
PdfBox
{code}
    public String extractXFDF() {
        try {
            @Cleanup
            PDDocument pdf = PDDocument.load(pdfFileName);
            pdf.setAllSecurityToBeRemoved(true);
            PDAcroForm form = pdf.getDocumentCatalog().getAcroForm();
            if (form == null) {
                throw new Pdf2OpxException("PDF file contains no Acroform");
            }
            @Cleanup
            FDFDocument fdf = form.exportFDF();
            @SuppressWarnings("unchecked")
            List<FDFField> fields = fdf.getCatalog().getFDF().getFields();
            sanitize(fields); // cf 
https://issues.apache.org/jira/browse/PDFBOX-2419
            @Cleanup
            StringWriter writer = new StringWriter();
            fdf.saveXFDF(writer);
            return writer.toString();
        } catch (COSVisitorException e) {
            throw new Pdf2OpxException("exception while extracting XFDF", e);
        } catch (IOException e) {
            throw new Pdf2OpxException("exception while reading PDF", e);
        }
    }

    private void sanitize(List<FDFField> fields) throws IOException {
        if (fields != null) {
            for (FDFField field : fields) {
                
field.setValue(XmlEscapers.xmlContentEscaper().escape(field.getValue().toString()));
                sanitize(field.getKids());
            }
        }

    }
{code}
The interesting part is in the sanitize method. We use Guava Escapers but it is 
simply a matter of replacing the three mentioned characters ('<', '>' and '&') 
by their XML escaped equivalent. It would be better to write correctly each 
field value directly rather than modifying them recursively afterwards.
It means that org.apache.pdfbox.pdmodel.fdf.FDFField.writeXML could be adjusted 
(even if a better approach would be to avoid using Strings directly in order to 
write XML).
I would have liked to suggest a patch but I was unable to compile PdfBox (maven 
couldn't resolve dependencies such as 
com.levigo.jbig2:levigo-jbig2-imageio:jar:1.6.3)

> XFDF export is not XML compliant
> --------------------------------
>
>                 Key: PDFBOX-2419
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2419
>             Project: PDFBox
>          Issue Type: Bug
>          Components: AcroForm
>    Affects Versions: 1.8.7
>            Reporter: Laurent Richard
>              Labels: FDF
>             Fix For: 1.8.8
>
>         Attachments: SampleForm.pdf
>
>
> The XFDF content is written as a simple string instead of XML nodes.
> As a result, field values containing special characters (&, <, >, ...) are 
> not escaped and the resulting XML is invalid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to