Proposition of donation of a PDF/A validator to the PDFBox project

Guillaume Bailleul Thu, 09 Jun 2011 13:50:00 -0700

Hi,

Last year, colleagues and I, developed a PDF/A validator. The result of our
work is now distributed under the Apache License 2. We did it because we had
not found any open source validator. Now, it looks working. Indeed, it works
for a while, but it was hard to have time for it. And we are ready to donate
it to Apache. Today this validator is called PaDaF, because there only are
PDF and A in the name. Source repository is on github :
https://github.com/gba-awl/padaf


Let me now explain what is PDF/A, why I think it could be a part of PDFBox
and how we made it.

PDF/A is an ISO standard for long term archiving of documents. It describe
how should be a PDF document to ensure it may be reproduced in unforeseeable
future. Our tool check conformance of the document with these specification.
You can learn more about PDF/A on the competence center web site :
http://www.pdfa.org
This web site propose an bunch of more than 200 invalid PDF/A. A complete
PDF/A validator should find the error for at least each document.

PaDaF is mainly based on a stream parser (done with javacc) and PDFBox.
That's why I think it could be integrated to PDFBox suite. The main
artefacts are 2 jars : preflight the validator API and xmpbox an API for xmp
manipulation. A 'jar with dependencies' exists and can be used in command
line.

First we tried to use jempbox for the xmp metadata(an xml block in the PDF
that contains metadata). But last year, jempbox was too light for our needs.
And because it was necessary to modify interface of jempbox, we decided to
do from scratch xmpbox. Today xmpbox is able to read or generate the xmp
block of a PDF. We use it to generate metadata in PDF files. It can be used
alone without PaDaF.

The validation of the file is done with a javacc grammar and PDFBox is used
to load objects when more checks must be done.

In previous version, we had to patch PDFBox to make all our tests working.
Since all the patches we proposed were included (mostly stuffs on fonts), we
can now use the standard 1.5.0 version of PDFBox. It is also compatible with
current head version of PDFBox when I write that message.

So today, we are ready to donate it and let it evolve with PDFBox. There is
work to do on the code to make it fitting Apache rule. Let us know if this
donation have its place in PDFBox and if there are some hands (and brains)
to help us.

I know that this mail was quite long, and its english was quite clumsy
(making it longer!) but maybe I forgot some piece of information or you have
some questions, so don't hesitate, ask...

Cordialement,

Guillaume

Proposition of donation of a PDF/A validator to the PDFBox project

Reply via email to