Hi,
Am 09.06.2011 22:49, schrieb Guillaume Bailleul:
Hi,
Last year, colleagues and I, developed a PDF/A validator. The result of our
work is now distributed under the Apache License 2. We did it because we had
not found any open source validator. Now, it looks working. Indeed, it works
for a while, but it was hard to have time for it. And we are ready to donate
it to Apache. Today this validator is called PaDaF, because there only are
PDF and A in the name. Source repository is on github :
https://github.com/gba-awl/padaf
Thanks four your offer !!
Let me now explain what is PDF/A, why I think it could be a part of PDFBox
and how we made it.
PDF/A is an ISO standard for long term archiving of documents. It describe
how should be a PDF document to ensure it may be reproduced in unforeseeable
future. Our tool check conformance of the document with these specification.
You can learn more about PDF/A on the competence center web site :
http://www.pdfa.org
This web site propose an bunch of more than 200 invalid PDF/A. A complete
PDF/A validator should find the error for at least each document.
PaDaF is mainly based on a stream parser (done with javacc) and PDFBox.
That's why I think it could be integrated to PDFBox suite. The main
artefacts are 2 jars : preflight the validator API and xmpbox an API for xmp
manipulation. A 'jar with dependencies' exists and can be used in command
line.
First we tried to use jempbox for the xmp metadata(an xml block in the PDF
that contains metadata). But last year, jempbox was too light for our needs.
And because it was necessary to modify interface of jempbox, we decided to
do from scratch xmpbox. Today xmpbox is able to read or generate the xmp
block of a PDF. We use it to generate metadata in PDF files. It can be used
alone without PaDaF.
The validation of the file is done with a javacc grammar and PDFBox is used
to load objects when more checks must be done.
In previous version, we had to patch PDFBox to make all our tests working.
Since all the patches we proposed were included (mostly stuffs on fonts), we
can now use the standard 1.5.0 version of PDFBox. It is also compatible with
current head version of PDFBox when I write that message.
So today, we are ready to donate it and let it evolve with PDFBox. There is
work to do on the code to make it fitting Apache rule. Let us know if this
donation have its place in PDFBox and if there are some hands (and brains)
to help us.
Sounds like a new incubator candidate .... [1]
I know that this mail was quite long, and its english was quite clumsy
(making it longer!) but maybe I forgot some piece of information or you have
some questions, so don't hesitate, ask...
As PaDaf is too big to be added as patch we have to use the incubator to include
the project.
There are a lot of things to do. The most important are:
- a software grant is needed to donate the software to the ASF [2]
- all of you who want to participate on the "new" project (I hope all of you
will join) have at least to sign an ICLA [3], probably a CCLA is also needed
We have to prepare a proposal for the incubation candidate[4], which includes:
- formulate a proposal ,e.g. PDFBox proposal [5]
- find a champion: I would volunteer
- find a sponsor: your intention is to integrate PaDaF into PDFBox, I agree with
that idea, but we have to ask the others if they'll support that idea too, so
that Apache PDFBox would be the sponsor
- find some mentors: as possible champion I would also volunteer as mentor. If
PDFBox will be the sponsor, there might hopefully some others who will join us
All the technical stuff could be discussed once the proposal was accepted as new
podling.
What do you think?
BR
Andreas Lehmkühler
[1] http://incubator.apache.org/
[2] http://www.apache.org/licenses/#grants
[3] http://www.apache.org/licenses/#clas
[4] http://incubator.apache.org/guides/proposal.html
[5] http://wiki.apache.org/incubator/PDFBoxProposal